[Pacemaker] Node in pending state, resources duplicated and data corruption

Fri Mar 14 11:32:18 UTC 2014

Hi to all!

We've a 4 node cluster and recently experienced a weird issue with Pacemaker that resulted in three
database instance resources duplicated (running simultaneously in 2 nodes) and subsequent data
corruption.

I've been investigating logs and could not arrive to a conclusion as to why did that happened. So
I'm writing to the list with details of the event to see if someone can help me pinpoint if there
was some problem with our operation of maybe we hit some bug.

OS: CentOS 6.4
Pacemaker version: 1.1.8
Stack: cman
Stonith enabled via DELL iDRAC in all 4 nodes
Nodes: gandalf, isildur, mordor, lorien

Timeline of events and logs:

- A resource monitor operation times out and resources in that node (gandalf) are being stopped

Mar  8 08:41:09 gandalf crmd[31561]:    error: process_lrm_event: LRM operation
vg_ifx_oltp_monitor_240000 (594) Timed Out (timeout=120000ms)

- Stopping resources in that node (gandalf) also times out and node is being killed by stonith from
other node (mordor)

Mar  8 08:42:54 gandalf lrmd[31558]:  warning: child_timeout_callback: vg_ifx_oltp_stop_0 process
(PID 17816) timed out
....
Mar  8 08:42:55 gandalf pengine[31560]:  warning: unpack_rsc_op: Processing failed op stop for
vg_ifx_oltp on gandalf.san01.cooperativaobrera.coop: unknown error (1)
Mar  8 08:42:55 gandalf pengine[31560]:  warning: pe_fence_node: Node
gandalf.san01.cooperativaobrera.coop will be fenced to recover from resource failure(s)

Mar  8 08:43:09 mordor corosync[25977]:   [TOTEM ] A processor failed, forming new configuration.
Mar  8 08:43:09 mordor stonith-ng[26212]:   notice: log_operation: Operation 'reboot' [4612] (call 0
from crmd.31561) for host 'gandalf.san01.cooperativaobrera.coop' with device 'st-gandalf' returned:
0 (OK)
Mar  8 08:43:21 mordor corosync[25977]:   [QUORUM] Members[3]: 2 3 4
Mar  8 08:43:21 mordor corosync[25977]:   [TOTEM ] A processor joined or left the membership and a
new membership was formed.
Mar  8 08:43:21 mordor crmd[26216]:   notice: crm_update_peer_state: cman_event_callback: Node
gandalf.san01.cooperativaobrera.coop[1] - state is now lost
Mar  8 08:43:21 mordor crmd[26216]:  warning: check_dead_member: Our DC node
(gandalf.san01.cooperativaobrera.coop) left the cluster
Mar  8 08:43:21 mordor kernel: dlm: closing connection to node 1
Mar  8 08:43:21 mordor corosync[25977]:   [CPG   ] chosen downlist: sender r(0) ip(172.16.1.1) r(1)
ip(172.16.2.1) ; members(old:4 left:1)
Mar  8 08:43:21 mordor corosync[25977]:   [MAIN  ] Completed service synchronization, ready to
provide service.
Mar  8 08:43:21 mordor fenced[26045]: fencing deferred to isildur.san01.cooperativaobrera.coop
Mar  8 08:43:22 mordor crmd[26216]:   notice: do_state_transition: State transition S_NOT_DC ->
S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
Mar  8 08:43:22 mordor crmd[26216]:   notice: do_state_transition: State transition S_ELECTION ->
S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Mar  8 08:43:22 mordor stonith-ng[26212]:   notice: remote_op_done: Operation reboot of
gandalf.san01.cooperativaobrera.coop by mordor.san01.cooperativaobrera.coop for
crmd.31561 at gandalf.san01.cooperativaobrera.coop.10d27664: OK
Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Peer
gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
mordor.san01.cooperativaobrera.coop for gandalf.san01.cooperativaobrera.coop: OK
(ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Notified CMAN that
'gandalf.san01.cooperativaobrera.coop' is now fenced
Mar  8 08:43:22 mordor crmd[26216]:   notice: tengine_stonith_notify: Target may have been our
leader gandalf.san01.cooperativaobrera.coop (recorded: <unset>)
Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.82 -> 0.513.83 from
lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than
required
Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.83 -> 0.513.84 from
lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than
required
Mar  8 08:43:22 mordor cib[26211]:  warning: cib_process_diff: Diff 0.513.84 -> 0.513.85 from
lorien.san01.cooperativaobrera.coop not applied to 0.513.85: current "num_updates" is greater than
required
Mar  8 08:43:22 mordor cib[26211]:   notice: cib_process_diff: Diff 0.513.85 -> 0.513.86 from
lorien.san01.cooperativaobrera.coop not applied to 0.513.85: Failed application of an update diff
Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending flush op to all hosts
for: last-failure-srv_mysql_dss (1384966716)
Mar  8 08:43:27 mordor crmd[26216]:   notice: do_state_transition: State transition S_PENDING ->
S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending flush op to all hosts
for: probe_complete (true)
Mar  8 08:43:27 mordor attrd[26214]:   notice: attrd_trigger_update: Sending flush op to all hosts
for: last-failure-vm_suc8300 (1384966659)
Mar  8 08:43:43 mordor stonith-ng[26212]:   notice: remote_op_done: Operation reboot of
gandalf.san01.cooperativaobrera.coop by lorien.san01.cooperativaobrera.coop for
crmd.9934 at lorien.san01.cooperativaobrera.coop.3b50e760: OK
Mar  8 08:43:43 mordor crmd[26216]:   notice: tengine_stonith_notify: Peer
gandalf.san01.cooperativaobrera.coop was terminated (st_notify_fence) by
lorien.san01.cooperativaobrera.coop for lorien.san01.cooperativaobrera.coop: OK
(ref=3b50e760-1142-4d3a-92b8-6936eb9df58c) by client crmd.9934
Mar  8 08:43:43 mordor crmd[26216]:   notice: tengine_stonith_notify: Notified CMAN that
'gandalf.san01.cooperativaobrera.coop' is now fenced

- Up to this point, gandalf resources are passed CORRECTLY to other cluster members. We had the
cluster with 3 nodes online and 1 node offline. Everything was working ok.

- After gandalf reboot is complete we checked node status, and then started pacemaker again manually
and node reacquires its resources CORRECTLY. So again we had the cluster with all 4 nodes online and
all resources running in their preferred nodes. Log of last resource transition back to gandalf:

Mar  8 08:51:40 lorien pengine[9933]:   notice: LogActions: Move    fs_mysql_hermes#011(Started
mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop)
Mar  8 08:51:40 lorien pengine[9933]:   notice: LogActions: Move    ip_mysql_hermes#011(Started
mordor.san01.cooperativaobrera.coop -> gandalf.san01.cooperativaobrera.coop)
Mar  8 08:51:40 lorien pengine[9933]:   notice: process_pe_message: Calculated Transition 16:
/var/lib/pacemaker/pengine/pe-input-14.bz2
Mar  8 08:51:43 lorien crmd[9934]:   notice: run_graph: Transition 16 (Complete=9, Pending=0,
Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete
Mar  8 08:51:43 lorien crmd[9934]:   notice: do_state_transition: State transition
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

- So, we decided to delete the operation monitor from the resource that caused the timeout (a LVM
Volume Group) to avoid having that same problem again. We did 'crm configure edit' and deleted two
op monitor, and then just after commiting that modification to the crm we experienced the problem of
resource duplication:

Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: --- 0.526.3
Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: Diff: +++ 0.527.1 ebf608fdccda78e755a61cdeea18a659
Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op
id="vg_ifx_oltp-monitor-240" interval="240" name="monitor" timeout="120" />
Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: --             <op id="vg_ifx_dss-monitor-240"
interval="240" name="monitor" timeout="120" />
Mar  8 09:13:52 lorien cib[9929]:   notice: cib:diff: ++ <cib epoch="527" num_updates="1"
admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.7"
update-origin="mordor.san01.cooperativaobrera.coop" update-client="crmd" cib-last-written="Sat Mar 
8 08:58:40 2014" have-quorum="1" dc-uuid="lorien.san01.cooperativaobrera.coop" />
Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Mar  8 09:13:52 lorien crmd[9934]:   notice: do_state_transition: State transition S_ELECTION ->
S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Mar  8 09:13:52 lorien crmd[9934]:   notice: do_dc_takeover: Marking
gandalf.san01.cooperativaobrera.coop, target of a previous stonith action, as clean
Mar  8 09:13:52 lorien crmd[9934]:  warning: match_down_event: No match for shutdown action on
gandalf.san01.cooperativaobrera.coop
Mar  8 09:13:52 lorien crmd[9934]:   notice: peer_update_callback: Stonith/shutdown of
gandalf.san01.cooperativaobrera.coop not matched
Mar  8 09:13:52 lorien crmd[9934]:   notice: crm_update_peer_state: send_stonith_update: Node
gandalf.san01.cooperativaobrera.coop[1] - state is now lost
Mar  8 09:13:53 lorien crmd[9934]:  warning: do_state_transition: 3 cluster nodes failed to respond
to the join offer.
Mar  8 09:13:53 lorien crmd[9934]:  warning: do_log: FSA: Input I_ELECTION_DC from
do_dc_join_finalize() received in state S_FINALIZE_JOIN
Mar  8 09:13:53 lorien crmd[9934]:    error: crmd_ais_dispatch: Recieving messages from a node we
think is dead: gandalf.san01.cooperativaobrera.coop[1]
Mar  8 09:13:53 lorien crmd[9934]:  warning: match_down_event: No match for shutdown action on
gandalf.san01.cooperativaobrera.coop
Mar  8 09:13:53 lorien crmd[9934]:   notice: peer_update_callback: Stonith/shutdown of
gandalf.san01.cooperativaobrera.coop not matched
Mar  8 09:13:53 lorien crmd[9934]:   notice: do_election_count_vote: Election 6 (current: 6, owner:
lorien.san01.cooperativaobrera.coop): Processed no-vote from gandalf.san01.cooperativaobrera.coop
(Peer is not part of our cluster)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)

- After that, node gandalf was in "pending" state but it was running all of its preferred resources.
All other 3 nodes were online and also running its preferred resources, but then cluster instructed
to start gandalf's resources in other nodes, but did not ordered to stonith gandalf, so => duplication:

Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending flush op to all hosts
for: last-failure-srv_mysql_hermes (1394279494)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending flush op to all hosts
for: last-failure-srv_mysql_oltp (1394279084)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending flush op to all hosts
for: probe_complete (true)
Mar  8 09:14:00 lorien attrd[9932]:   notice: attrd_trigger_update: Sending flush op to all hosts
for: last-failure-srv_mysql_it (1384967023)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
st-lorien#011(isildur.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
fs_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
ip_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_mysql_hermes#011(lorien.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_jabberd#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_conference#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_orangutan#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
fs_jabber#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
ip_jabber#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
fs_bkoltp_ifx#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
ip_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
fs_mysql_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_mysql_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
vg_ifx_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_ifx_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_xinetd_oltp#011(mordor.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: LogActions: Start  
srv_xinetd_hermes#011(lorien.san01.cooperativaobrera.coop)
Mar  8 09:14:01 lorien pengine[9933]:   notice: process_pe_message: Calculated Transition 53:
/var/lib/pacemaker/pengine/pe-input-50.bz2

- These last logs are the DC orders to start gandalf's resources in other members.

- Our databases executed in duplicated instances for 7 minutes, but that was sufficient for build
some pretty wild data corruption and we had to go to complete backup restore those.

Can you identify some problem in this operation? We did something wrong? Is this a known bug? I'm
thinking of something related to the first stonith kill (gandalf was DC at that time) and some
inconsistent state after that. Also there is an election (?) just in the moment of passing the cib
modification to other members. Could this be related?

Many thanks in advance for any insight you could help us with.

Kind regards!

-- 
Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
ggomiz at cooperativaobrera.coop
Gerencia de Sistemas - Cooperativa Obrera Ltda.
Tel: (0291) 403-9700

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 555 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140314/295acdc2/attachment-0003.sig>