[Pacemaker] HA across WDM Fibre link - Nodes won't rejoin after reboot

Mon Apr 2 09:53:53 EDT 2012

Hi everyone.

I have 2 nodes running on ESX hosts in 2 geographically diverse data
centres. The link between them is a DWDM fibre link which is the only
thing I can think of as being the cause of this.

SLES 11 SP1 with HAE. All latest updates.

If Corosync is set to Multicast on the default address, there are no
comms between Corosync on the nodes. If I use broadcast, it will
communicate and let the nodes join.

If I reboot node 2, it rejoins fine. If I reboot node 1, it enters a
pending phase for a while then just drops to offline. I can then clear
the config out again and let the nodes rejoin. Node 1 always seems to be
the DC.

Pending - logs from node 1, loops this every second:

-02: id=336371722 state=member (new) addr=r(0) ip(10.160.12.20)  votes=1
born=7912 seen=7920 proc=00000000000000000000000000151312

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: crm_update_peer: Node
PPS-VMAIL-01: id=168599562 state=member (new) addr=r(0) ip(10.160.12.10)
(new) votes=1 (new) born=7920 seen=7920
proc=00000000000000000000000000151312 (new)

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: WARN: do_log: FSA: Input
I_SHUTDOWN from revision_check_callback() received in state S_STARTING

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_state_transition:
State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN
cause=C_FSA_INTERNAL origin=revision_check_callback ]

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_lrm_control:
Disconnected from the LRM

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_ha_control:
Disconnected from OpenAIS

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_cib_control:
Disconnecting CIB

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_exit: Performing
A_EXIT_0 - gracefully exiting the CRMd

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: free_mem: Dropping
I_NULL: [ state=S_STOPPING cause=C_FSA_INTERNAL
origin=register_fsa_error_adv ]

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: free_mem: Dropping
I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]

Apr  2 14:37:13 PPS-VMAIL-01 crmd: [3896]: info: do_exit: [crmd] stopped
(0)

Offline - logs from node 1, loops every second:

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: cib_replace_notify:
Local-only Replace: 0.0.0 from PP2-VMAIL-02

Apr  2 14:38:06 PPS-VMAIL-01 attrd: [3512]: info: do_cib_replaced:
Sending full refresh

Apr  2 14:38:06 PPS-VMAIL-01 attrd: [3512]: info: attrd_trigger_update:
Sending flush op to all hosts for: probe_complete (<null>)

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: apply_xml_diff: Digest
mis-match: expected 0cf389141d344ca552679f9924d281c5, calculated
818a100a0e3b725068393624381c9d4f

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: notice: cib_process_diff: Diff
0.13.642 -> 0.0.0 not applied to 0.13.642: Failed application of an
update diff

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: info: cib_server_process_diff:
Requesting re-sync from peer

Apr  2 14:38:06 PPS-VMAIL-01 cib: [3510]: WARN: cib_diff_notify:
Local-only Change (client:attrd, call: 1221): 0.0.0 (Application of an
update diff failed, requesting a full refresh)

Offline - logs from node 2, loops every second:

Apr  2 14:39:05 PP2-VMAIL-02 corosync[3794]:  [TOTEM ] Retransmit List:
29b7 29b8 29b9

Apr  2 14:39:05 PP2-VMAIL-02 corosync[3794]:  [TOTEM ] Retransmit List:
29bb 29bc

Apr  2 14:39:05 PP2-VMAIL-02 cib: [3801]: info: cib_process_request:
Operation complete: op cib_sync_one for section 'all'
(origin=PPS-VMAIL-01/PPS-VMAIL-01/(null), version=0.13.1538): ok (rc=0)

Any ideas please?

Thanks.

Darren Mansell

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120402/739169ac/attachment-0002.html>