[Pacemaker] cLVM stuck

Fri Feb 10 02:56:20 EST 2012

Quoting Andreas Kurz <andreas at hastexo.com>:

> Hello,
>
> On 02/09/2012 03:29 PM, Karl Rößmann wrote:
>> Hi all,
>>
>> we run a three Node HA Cluster using cLVM and Xen.
>>
>> After installing some online updates node by node
>
> While cluster was in maintenance-mode or when cluster was shut down on
> the node that received updates?
>

the cluster was shutdown on the node that received the updates

>> the cLVM is stuck, and the last (updated) Node does not want to join the
>> the cLVM.
>> Two nodes are still running
>> The xen VMs are running (they have there disks on cLVs),
>> but commands like 'lvdisplay' do not work.
>>
>> Is there a way to recover the cLVM without restarting the whole cluster ?
>
> Any logentries about the controld on orion1 ... what is the output of
> "crm_mon -1fr"? ... looks like there is a problem with starting the
> dlm_controld.pcmk .... and I wonder why orion1 is not fenced on stop
> errors, or did that happen?

yes orion was fenced.
controld entries  on orion1:
Feb  9 10:31:56 orion1 cluster-dlm: log_config: dlm:controld conf 2 0  
1 memb 16885952 117549248 join left 33663168
Feb  9 10:31:56 orion1 cluster-dlm: log_config: dlm:controld conf 2 0  
1 memb 16885952 117549248 join left 33663168
Feb  9 10:42:23 orion1 cluster-dlm: log_config: dlm:controld conf 3 1  
0 memb 16885952 33663168 117549248 join 33663168 left
Feb  9 10:46:04 orion1 cluster-dlm: log_config: dlm:controld conf 2 0  
1 memb 16885952 33663168 join left 117549248
Feb  9 10:46:04 orion1 cluster-dlm: log_config: dlm:controld conf 2 0  
1 memb 16885952 33663168 join left 117549248
Feb  9 10:54:25 orion1 cluster-dlm: log_config: dlm:controld conf 3 1  
0 memb 16885952 33663168 117549248 join 117549248 left
Feb  9 11:11:23 orion1 lrmd: [6031]: info: RA output:  
(dlm:0:probe:stderr) dlm_controld.pcmk: no process found
Feb  9 11:11:30 orion1 lrmd: [6031]: info: RA output:  
(dlm:0:start:stderr) dlm_controld.pcmk: no process found
Feb  9 11:11:30 orion1 cluster-dlm[6347]: main: dlm_controld master started
Feb  9 11:11:30 orion1 cluster-dlm: log_config: dlm:controld conf 3 1  
0 memb 16885952 33663168 117549248 join 16885952 left
Feb  9 12:02:39 orion1 lrmd: [6459]: info: RA output:  
(dlm:0:probe:stderr) dlm_controld.pcmk: no process found
Feb  9 12:02:41 orion1 lrmd: [6459]: info: RA output:  
(dlm:0:start:stderr) dlm_controld.pcmk: no process found
Feb  9 12:02:41 orion1 cluster-dlm[6773]: main: dlm_controld master started
Feb  9 12:02:41 orion1 cluster-dlm: log_config: dlm:controld conf 3 1  
0 memb 16885952 33663168 117549248 join 16885952 left
Feb  9 13:35:07 orion1 lrmd: [14347]: info: RA output:  
(dlm:0:probe:stderr) dlm_controld.pcmk: no process found
Feb  9 13:35:09 orion1 lrmd: [14347]: info: RA output:  
(dlm:0:start:stderr) dlm_controld.pcmk: no process found
Feb  9 13:35:09 orion1 cluster-dlm[14773]: main: dlm_controld master started
Feb  9 13:35:09 orion1 cluster-dlm: log_config: dlm:controld conf 3 1  
0 memb 16885952 33663168 117549248 join 16885952 left

output from crm_mon -1fr:
============
Last updated: Fri Feb 10 08:41:53 2012
Stack: openais
Current DC: orion2 - partition with quorum
Version: 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60
3 Nodes configured, 3 expected votes
17 Resources configured.
============

Online: [ orion2 orion7 ]
OFFLINE: [ orion1 ]

Full list of resources:

  Clone Set: dlm_clone [dlm]
      Started: [ orion2 orion7 ]
      Stopped: [ dlm:0 ]
  Clone Set: clvm_clone [clvm]
      Started: [ orion2 orion7 ]
      Stopped: [ clvm:0 ]
  sbd_stonith    (stonith:external/sbd): Started orion2
  Clone Set: cluvg1_clone [cluvg1]
      Started: [ orion2 orion7 ]
      Stopped: [ cluvg1:2 ]
  styx   (ocf::heartbeat:Xen):   Started orion7
  shib   (ocf::heartbeat:Xen):   Started orion7
  wiki   (ocf::heartbeat:Xen):   Started orion2
  horde  (ocf::heartbeat:Xen):   Started orion7
  www    (ocf::heartbeat:Xen):   Started orion7
  enventory      (ocf::heartbeat:Xen):   Started orion2
  mailrelay      (ocf::heartbeat:Xen):   Started orion2

Migration summary:
* Node orion2:
* Node orion7:

>
> Did you inspect the output of "dlm_tool ls/dump" on all nodes where the
> controld is running?
>

output from dlm_tool ls from orion2:

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 3,3
members       16885952 33663168 117549248
new change    member 2 joined 0 remove 1 failed 1 seq 4,10
new status    wait_messages 1 wait_condition 0
new members   33663168 117549248

output from dlm_tool ls from orion7:

dlm lockspaces
name          clvmd
id            0x4104eefa
flags         0x00000004 kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 1,1
members       16885952 33663168 117549248
new change    member 2 joined 0 remove 1 failed 1 seq 2,8
new status    wait_messages 1 wait_condition 0
new members   33663168 117549248

output from dlm_tool dump from orion2:
...
1328790910 clvmd wait_messages cg 9 need 2 of 3
1328792190 Processing membership 572
1328792190 Skipped active node 33663168: born-on=528, last-seen=572,  
this-event=572, last-event=568
1328792190 Skipped active node 117549248: born-on=536, last-seen=572,  
this-event=572, last-event=568
1328792190 del_configfs_node rmdir  
"/sys/kernel/config/dlm/cluster/comms/16885952"
1328792190 Removed inactive node 16885952: born-on=568, last-seen=568,  
this-event=572, last-event=568
1328792190 dlm:controld conf 2 0 1 memb 33663168 117549248 join left 16885952
1328792190 dlm:ls:clvmd conf 2 0 1 memb 33663168 117549248 join left 16885952
1328792190 clvmd add_change cg 10 remove nodeid 16885952 reason 3
1328792190 clvmd add_change cg 10 counts member 2 joined 0 remove 1 failed 1
1328792190 Node 16885952/orion1 has not been shot yet
1328792190 clvmd check_fencing 16885952 not fenced add 1328790910 fence 0
1328792540 Node 16885952/orion1 was last shot 'now'
1328792540 clvmd check_fencing 16885952 1328790910 fenced at 1328792540
1328792540 clvmd check_fencing done
1328792540 clvmd check_quorum disabled
1328792540 clvmd check_fs none registered
1328792540 clvmd send nack old cg 8 new cg 10
1328792540 clvmd send_start cg 8 flags 6 data2 0 counts 3 2 0 1 1
1328792540 clvmd send nack old cg 6 new cg 10
1328792540 clvmd send_start cg 6 flags 6 data2 0 counts 3 2 0 1 1
1328792540 clvmd send nack old cg 4 new cg 10
1328792540 clvmd send_start cg 4 flags 6 data2 0 counts 3 2 0 1 1
1328792540 clvmd send_start cg 10 flags 2 data2 0 counts 3 2 0 1 1
1328792540 clvmd receive_start 33663168:8 len 80
1328792540 clvmd match_change 33663168:8 skip 4 already start
1328792540 clvmd match_change 33663168:8 skip 5 already start
1328792540 clvmd match_change 33663168:8 skip 6 is nacked
1328792540 clvmd match_change 33663168:8 skip 7 is nacked
1328792540 clvmd match_change 33663168:8 skip 8 is nacked
1328792540 clvmd match_change 33663168:8 skip 9 is nacked
1328792540 clvmd match_change 33663168:8 matches cg 10
1328792540 clvmd receive_start 33663168:8 is NACK
1328792540 clvmd wait_messages cg 10 need 2 of 2
...

output from dlm_tool dump from orion7:
...
1328792190 Processing membership 572
1328792190 Skipped active node 33663168: born-on=528, last-seen=572,  
this-event=572, last-event=568
1328792190 Skipped active node 117549248: born-on=536, last-seen=572,  
this-event=572, last-event=568
1328792190 del_configfs_node rmdir  
"/sys/kernel/config/dlm/cluster/comms/16885952"
1328792190 Removed inactive node 16885952: born-on=568, last-seen=568,  
this-event=572, last-event=568
1328792190 dlm:controld conf 2 0 1 memb 33663168 117549248 join left 16885952
1328792190 dlm:ls:clvmd conf 2 0 1 memb 33663168 117549248 join left 16885952
1328792190 clvmd add_change cg 8 remove nodeid 16885952 reason 3
1328792190 clvmd add_change cg 8 counts member 2 joined 0 remove 1 failed 1
1328792190 Node 16885952/orion1 has not been shot yet
1328792190 clvmd check_fencing 16885952 not fenced add 1328790910 fence 0
1328792540 clvmd receive_start 33663168:8 len 80
1328792540 clvmd match_change 33663168:8 skip 2 already start
1328792540 clvmd match_change 33663168:8 skip 3 already start
1328792540 clvmd match_change 33663168:8 skip 4 is nacked
1328792540 clvmd match_change 33663168:8 skip 5 is nacked
1328792540 clvmd match_change 33663168:8 skip 6 is nacked
1328792540 clvmd match_change 33663168:8 skip 7 is nacked
1328792540 clvmd match_change 33663168:8 matches cg 8
1328792540 clvmd receive_start 33663168:8 is NACK
1328792540 Node 16885952/orion1 was last shot 'now'
1328792540 clvmd check_fencing 16885952 1328790910 fenced at 1328792540
1328792540 clvmd check_fencing done
1328792540 clvmd check_quorum disabled
1328792540 clvmd check_fs none registered
1328792540 clvmd send nack old cg 6 new cg 8
1328792540 clvmd send_start cg 6 flags 6 data2 0 counts 1 2 0 1 1
1328792540 clvmd send nack old cg 4 new cg 8
1328792540 clvmd send_start cg 4 flags 6 data2 0 counts 1 2 0 1 1
1328792540 clvmd send nack old cg 2 new cg 8
1328792540 clvmd send_start cg 2 flags 6 data2 0 counts 1 2 0 1 1
1328792540 clvmd send_start cg 8 flags 2 data2 0 counts 1 2 0 1 1
1328792540 clvmd receive_start 33663168:6 len 80
1328792540 clvmd match_change 33663168:6 skip 2 already start
1328792540 clvmd match_change 33663168:6 skip 3 already start
1328792540 clvmd match_change 33663168:6 skip 4 is nacked
1328792540 clvmd match_change 33663168:6 skip 5 is nacked
1328792540 clvmd match_change 33663168:6 skip 6 is nacked
1328792540 clvmd match_change 33663168:6 skip 7 is nacked
1328792540 clvmd match_change 33663168:6 skip 8 is nacked
1328792540 clvmd find_change 33663168:6 no match
1328792540 clvmd wait_messages cg 8 need 2 of 2
1328792540 clvmd receive_start 33663168:4 len 80
1328792540 clvmd match_change 33663168:4 skip 2 already start
1328792540 clvmd match_change 33663168:4 skip 3 already start
...

> Your crm_mon output shows orion1 offline ... seems not to be timely
> related to your logs?
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>
>>
>> we have the latest
>> SuSE Sles SP1 and HA-Extension including
>> pacemaker-1.1.5-5.9.11.1
>> corosync-1.3.3-0.3.1
>>
>> Some ERROR messages:
>> Feb  9 12:06:42 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
>> operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
>> Feb  9 12:13:41 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
>> operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
>> Feb  9 12:16:21 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
>> operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
>> Feb  9 13:39:10 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
>> operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
>> Feb  9 13:53:38 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
>> operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
>> Feb  9 13:56:18 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
>> operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
>>
>>
>>
>> Feb  9 12:11:55 orion2 crm_resource: [13025]: ERROR:
>> resource_ipc_timeout: No messages received in 60 seconds
>> Feb  9 12:13:41 orion2 crmd: [5882]: ERROR: send_msg_via_ipc: Unknown
>> Sub-system (13025_crm_resource)... discarding message.
>> Feb  9 12:14:41 orion2 crmd: [5882]: ERROR: print_elem: Aborting
>> transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,
>> loc: orion1, priority: 0)
>> Feb  9 13:54:38 orion2 crmd: [5882]: ERROR: print_elem: Aborting
>> transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,
>> loc: orion1, priority: 0)
>>
>>
>>
>> Some additional information:
>>
>> crm_mon -1:
>> ============
>> Last updated: Thu Feb  9 15:10:34 2012
>> Stack: openais
>> Current DC: orion2 - partition with quorum
>> Version: 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60
>> 3 Nodes configured, 3 expected votes
>> 17 Resources configured.
>> ============
>>
>> Online: [ orion2 orion7 ]
>> OFFLINE: [ orion1 ]
>>
>>  Clone Set: dlm_clone [dlm]
>>      Started: [ orion2 orion7 ]
>>      Stopped: [ dlm:0 ]
>>  Clone Set: clvm_clone [clvm]
>>      Started: [ orion2 orion7 ]
>>      Stopped: [ clvm:0 ]
>>  sbd_stonith    (stonith:external/sbd): Started orion2
>>  Clone Set: cluvg1_clone [cluvg1]
>>      Started: [ orion2 orion7 ]
>>      Stopped: [ cluvg1:2 ]
>>  styx   (ocf::heartbeat:Xen):   Started orion7
>>  shib   (ocf::heartbeat:Xen):   Started orion7
>>  wiki   (ocf::heartbeat:Xen):   Started orion2
>>  horde  (ocf::heartbeat:Xen):   Started orion7
>>  www    (ocf::heartbeat:Xen):   Started orion7
>>  enventory      (ocf::heartbeat:Xen):   Started orion2
>>  mailrelay      (ocf::heartbeat:Xen):   Started orion2
>>
>>
>>
>> crm configure show
>> node orion1 \
>>         attributes standby="off"
>> node orion2 \
>>         attributes standby="off"
>> node orion7 \
>>         attributes standby="off"
>> primitive cluvg1 ocf:heartbeat:LVM \
>>         params volgrpname="cluvg1" \
>>         op start interval="0" timeout="240s" \
>>         op stop interval="0" timeout="100s" \
>>         meta target-role="Started"
>> primitive clvm ocf:lvm2:clvmd \
>>         params daemon_timeout="30" \
>>         op start interval="0" timeout="240s" \
>>         op stop interval="0" timeout="100s" \
>>         meta target-role="Started"
>> primitive dlm ocf:pacemaker:controld \
>>         op monitor interval="120s" \
>>         op start interval="0" timeout="240s" \
>>         op stop interval="0" timeout="100s" \
>>         meta target-role="Started"
>> primitive enventory ocf:heartbeat:Xen \
>>         meta target-role="Started" allow-migrate="true" \
>>         operations $id="enventory-operations" \
>>         op monitor interval="10" timeout="30" \
>>         op migrate_from interval="0" timeout="600" \
>>         op migrate_to interval="0" timeout="600" \
>>         params xmfile="/etc/xen/vm/enventory" shutdown_timeout="60"
>> primitive horde ocf:heartbeat:Xen \
>>         meta target-role="Started" is-managed="true" allow-migrate="true" \
>>         operations $id="horde-operations" \
>>         op monitor interval="10" timeout="30" \
>>         op migrate_from interval="0" timeout="600" \
>>         op migrate_to interval="0" timeout="600" \
>>         params xmfile="/etc/xen/vm/horde" shutdown_timeout="120"
>> primitive sbd_stonith stonith:external/sbd \
>>         params
>> sbd_device="/dev/disk/by-id/scsi-360080e50001c150e0000019e4df6d4d5-part1" \
>>         meta target-role="started"
>> ...
>> ...
>> ...
>> clone cluvg1_clone cluvg1 \
>>         meta interleave="true" target-role="started" is-managed="true"
>> clone clvm_clone clvm \
>>         meta globally-unique="false" interleave="true"
>> target-role="started"
>> clone dlm_clone dlm \
>>         meta globally-unique="false" interleave="true"
>> target-role="started"
>> colocation cluvg1_with_clvm inf: cluvg1_clone clvm_clone
>> colocation clvm_with_dlm inf: clvm_clone dlm_clone
>> colocation enventory_with_cluvg1 inf: enventory cluvg1_clone
>> colocation horde_with_cluvg1 inf: horde cluvg1_clone
>> ...
>> ... more Xen VMs
>> ...
>> order cluvg1_before_enventory inf: cluvg1_clone enventory
>> order cluvg1_before_horde inf: cluvg1_clone horde
>> order clvm_before_cluvg1 inf: clvm_clone cluvg1_clone
>> order dlm_before_clvm inf: dlm_clone clvm_clone
>> property $id="cib-bootstrap-options" \
>>         dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
>>         cluster-infrastructure="openais" \
>>         expected-quorum-votes="3" \
>>         stonith-timeout="420s" \
>>         last-lrm-refresh="1328792018"
>> rsc_defaults $id="rsc_defaults-options" \
>>         resource-stickiness="10"
>> op_defaults $id="op_defaults-options" \
>>         record-pending="false"
>>
>>
>
>
>

-- 
Karl Rößmann				Tel. +49-711-689-1657
Max-Planck-Institut FKF       		Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart				email K.Roessmann at fkf.mpg.de