[ClusterLabs] Pacemaker crash and fencing failure

Fri Nov 20 19:38:08 EST 2015

I've been trying to debug and do a root cause analysis for a cascading
series of failures that a customer hit a couple of days ago, that
caused their filesystem to be unavailable for a couple of hours.

The original failure was in our own distributed filesystem backend, a
fork of LizardFS, which is in turn a fork of MoosFS This history is
mostly only important in reading the logs, where "efs", "lizardfs",
and "mfs" all generally refer to the same services, just different
generations of naming them as not all daemons, scripts, and packages
have been renamed.

There are two master servers that handle metadata operations, running
Pacemaker to elect which one is the current primary and which one is a
replica, and a number of chunkservers that store file chunks and
simply connect to the current running master via a virtual IP. A bug
in doing a checksum scan on the chunkservers caused them to leak file
descriptors and become unresponsive, so while the master server was up
and healthy, no actual filesystem operations could occur. (This bug is
now fixed by the way, and the fix deployed to the customer, but we
want to debug why the later failures occurred that caused them to
continue to have downtime).

The customer saw that things were unresponsive, and tried doing the
simplest thing they could to try to resolve it, migrate the services
to the other master. This succeeded, as the checksum scan had been
initiated by the first master and so switching over to the replica
caused all of the extra file descriptors to be closed and the
chunkservers to become responsive again.

However, due to one backup service that is not yet managed via
Pacemaker and thus is only running on the first master, they decided
to migrate back to the first master. This was when they ran into a
Pacemaker problem.

At the time of the problem, es-efs-master1 is the server that was
originally the master when the first problem happened, and which they
are trying to migrate the services back to. es-efs-master2 is the one
actively running the services, and also happens to be the DC at the
time to that's where to look for pengine messages.

On master2, you can see the point when the user tried to migrate back
to master1 based on the pengine decisions:

(by the way, apologies for the long message with large log excerpts; I
was trying to balance enough detail with not overwhelming, it can be
hard to keep it short when explaining these kinds of complicated
failures across a number of machines)

Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-
46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
a failed demote action
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Move    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
es-efs-master2 -> es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
-> Slave es-efs-master2)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
process_pe_message: Calculated Transition 1481601:
/var/lib/pacemaker/pengine/pe-input-1355.bz2
Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not
applied to 0.2754083.1: Failed application of an update diff
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
(call=400, rc=0, cib-update=0, confirmed=true) ok
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice: run_graph:
Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15,
Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-1355.bz2):
Stopped
Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op demote for
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
on es-efs-master2: unknown error (1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
to stop after a failed demote action
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Move    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
es-efs-master2 -> es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
-> Slave es-efs-master2)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
process_pe_message: Calculated Transition 1481602:
/var/lib/pacemaker/pengine/pe-input-1356.bz2
Nov 18 08:28:28 es-efs-master2 crmd[1924]:  warning: destroy_action:
Cancelling timer for action 14 (src=4445462)
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
(call=403, rc=0, cib-update=0, confirmed=true) ok

A little later on, we see that we have successfully demoted our
lizardfs-master service:

Nov 18 08:28:31 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0
(call=406, rc=0, cib-update=2963647, confirmed=true) ok

The logs go on for a few seconds, as we are migrating a number of
other services as well (a variety of management services, LDAP, and so
on). If we switch back to our master1 logs, a few seconds later we
some failures of crmd talking to lrmd:

Nov 18 08:28:39 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
Request 393 to lrmd (0x7f51b85f8e30) failed: Resource temporarily
unavailable (-11)
Nov 18 08:28:39 es-efs-master1 crmd[1826]:    error:
lrmd_send_command: Couldn't perform lrmd_rsc_info operation
(timeout=0): -11: Connection timed out (110)
Nov 18 08:28:40 es-efs-master1 mfsmaster[20053]: connecting to Master
Nov 18 08:28:40 es-efs-master1 mfsmaster[20053]: connected to Master
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
recieved
Nov 18 08:28:40 es-efs-master1 crmd[1826]:    error:
lrmd_send_command: Couldn't perform lrmd_rsc_register operation
(timeout=0): -114: Connection timed out (110)
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
recieved
Nov 18 08:28:40 es-efs-master1 crmd[1826]:    error:
lrmd_send_command: Couldn't perform lrmd_rsc_info operation
(timeout=0): -114: Connection timed out (110)
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
bad input   <create_request_adv origin="te_rsc_command" t="crmd"
version="3.0.7" subt="request"
reference="lrm_invoke-tengine-1447853313-1481932"
crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine"
crm_host_to="es-efs-master1" src="es-efs-master2">
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
bad input     <crm_xml>
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
bad input       <rsc_op id="87" operation="start"
operation_key="editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd_start_0"
on_node="es-efs-master1" on_node_uuid="1"
transition-key="87:1481605:0:df645faa-fb80-4aa4-89dd-efa8090f9807">
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
bad input         <primitive
id="editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd"
class="upstart" type="editshare-dbd"/>
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
bad input         <attributes CRM_meta_timeout="20000"
crm_feature_set="3.0.7"/>
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
bad input       </rsc_op>
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
bad input     </crm_xml>
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
bad input   </create_request_adv>
Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
recieved

A few more seconds later, and it looks like crmd gives up and dies, at
which point pacemakerd notices and restarts it:

Nov 18 08:28:43 es-efs-master1 crmd[1826]:  warning: do_log: FSA:
Input I_FAIL from get_lrm_resource() received in state S_NOT_DC
Nov 18 08:28:43 es-efs-master1 crmd[1826]:   notice:
do_state_transition: State transition S_NOT_DC -> S_RECOVERY [
input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ]
Nov 18 08:28:43 es-efs-master1 crmd[1826]:  warning: do_log: FSA:
Input I_FAIL from get_lrm_resource() received in state S_RECOVERY
Nov 18 08:28:43  crmd[1826]: last message repeated 2 times
Nov 18 08:28:43 es-efs-master1 crmd[1826]:    error: do_log: FSA:
Input I_TERMINATE from do_recover() received in state S_RECOVERY
Nov 18 08:28:44 es-efs-master1 mfsmaster[20053]: connecting to Master
Nov 18 08:28:44 es-efs-master1 mfsmaster[20053]: connected to Master
Nov 18 08:28:44 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
recieved
Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation
(timeout=0): -114: Connection timed out (110)
Nov 18 08:28:44 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
recieved
Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation
(timeout=0): -114: Connection timed out (110)
Nov 18 08:28:44 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
recieved
Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation
(timeout=0): -114: Connection timed out (110)
Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
lrm_state_verify_stopped: 2 pending LRM operations at shutdown(null)
Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
lrm_state_verify_stopped: Pending action:
editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server:287
(editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server_start_0)
Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
lrm_state_verify_stopped: Pending action:
editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver:285
(editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver_start_0)
Nov 18 08:28:44 es-efs-master1 crmd[1826]:   notice:
terminate_cs_connection: Disconnecting from Corosync
Nov 18 08:28:44 es-efs-master1 cib[1821]:  warning:
qb_ipcs_event_sendv: new_event_notification (1821-1826-10): Broken
pipe (32)
Nov 18 08:28:44 es-efs-master1 pacemakerd[1760]:    error:
pcmk_child_exit: Child process crmd (1826) exited: Generic Pacemaker
error (201)
Nov 18 08:28:44 es-efs-master1 crmd[7128]:   notice:
crm_cluster_connect: Connecting to cluster infrastr
ucture: corosync
Nov 18 08:28:44 es-efs-master1 crmd[7128]:   notice:
corosync_node_name: Unable to get node name for nod
eid 1
Nov 18 08:28:44 es-efs-master1 crmd[7128]:   notice: get_node_name:
Defaulting to uname -n for the local
 corosync node name
Nov 18 08:28:44 es-efs-master1 crmd[7128]:   notice:
cluster_connect_quorum: Quorum acquired

Back over on master2, this causes it to decide to initiate a fencing action:

Nov 18 08:28:44 es-efs-master2 crmd[1924]:   notice: run_graph:
Transition 1481605 (Complete=6, Pending=
0, Fired=0, Skipped=38, Incomplete=12,
Source=/var/lib/pacemaker/pengine/pe-input-1359.bz2): Stopped
Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning: pe_fence_node:
Node es-efs-master1 will be fence
d because our peer process is no longer available
Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning:
determine_online_status: Node es-efs-master1 is
unclean
Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op demote for e
ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
on es-efs-master2: unkno
wn error (1)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-
46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
a failed demote action
Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning: stage6:
Scheduling Node es-efs-master1 for STONI
TH
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Move    editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master1 -> es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Move    editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Master
es-efs-master1 -> es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave ->
Master es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Stop    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Stop    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0#011(es-efs-master1)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Stop    editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(es-efs-master1)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver#011(es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server#011(es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master2)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
Stop    fence-es-efs-master2-ipmi#011(es-efs-master1)
Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning:
process_pe_message: Calculated Transition 1481606:
/var/lib/pacemaker/pengine/pe-warn-374.bz2
Nov 18 08:28:44 es-efs-master2 crmd[1924]:   notice: run_graph:
Transition 1481606 (Complete=0, Pending=0, Fired=0, Skipped=43,
Incomplete=20, Source=/var/lib/pacemaker/pengine/pe-warn-374.bz2):
Stopped

However, at this point, nothing happens. There are no further logs
about fencing, stonith, ipmi, for another few minutes, other than a
couple of occasional warnings like the following:

Nov 18 08:28:44 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754086.12 -> 0.2754086.13 from local not
applied to 0.2754086.12: Failed application of an update diff

Nov 18 08:34:58 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754086.20 -> 0.2754086.21 from local not
applied to 0.2754086.20: Failed application of an update diff

Nov 18 08:34:59 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754086.25 -> 0.2754086.26 from local not
applied to 0.2754086.25: Failed application of an update diff

We then see pengine react as if master1 is back online again, even
though it was never fenced:

Nov 18 08:35:00 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op demote for e
ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0
on es-efs-master2: unkno
wn error (1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-
46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0 to stop after
a failed demote action
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.ip#011(es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Stop    editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(es-efs-master2)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.slapd.primitive:1#011(es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc
0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  editshare.stack.ffab6f74-2fc
0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(Master -> Slave es-efs-master2)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc
0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote editshare.stack.ffab6f74-2fc
0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(Stopped -> Master es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc
0-43e8-99f4-ab446db23ac3.editshare-superserver#011(es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc
0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc
0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   fence-es-efs-master2-ipmi#011(es-efs-master1)
Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice:
process_pe_message: Calculated Transition 1481607:
/var/lib/pacemaker/pengine/pe-input-1360.bz2

Nov 18 08:35:01 es-efs-master2 crmd[1924]:   notice: run_graph:
Transition 1481607 (Complete=26, Pending=0, Fired=0, Skipped=35,
Incomplete=18, Source=/var/lib/pacemaker/pengine/pe-input-1360.bz2):
Stopped
Nov 18 08:35:01 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op demote for
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
on es-efs-master2: unknown error (1)
Nov 18 08:35:01 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
to stop after a failed demote action
Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
Stop    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2)
Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master1)
Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master1)
Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master1)
Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice:
process_pe_message: Calculated Transition 1481608:
/var/lib/pacemaker/pengine/pe-input-1361.bz2
Nov 18 08:35:01 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
(call=463, rc=0, cib-update=0, confirmed=true) ok
Nov 18 08:35:01 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_notify_0
(call=465, rc=0, cib-update=0, confirmed=true) ok

And it just keeps on going on, seemingly getting stuck because
everything on master1 is giving unknown errors:

Nov 18 08:36:21 es-efs-master2 crmd[1924]:   notice: run_graph:
Transition 1481608 (Complete=12, Pending
=0, Fired=0, Skipped=19, Incomplete=13,
Source=/var/lib/pacemaker/pengine/pe-input-1361.bz2): Stopped
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op monitor for
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip on
es-efs-master1: unknown error (1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op monitor for
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0
on es-efs-master1: unknown error
(1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op monitor for
editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver
on es-efs-master1: unknown er
ror (1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op start for ed
itshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd on
es-efs-master1: unknown error (1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op start for ed
itshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web on
es-efs-master1: unknown error (1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op monitor for
fence-es-efs-master2-ipmi on es-efs-master1: unknown error (1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op demote for e
ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
on es-efs-master2: unkno
wn error (1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-
46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
a failed demote action
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning:
common_apply_stickiness: Forcing editshare.stack
.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd away from
es-efs-master1 after 1000000 failures (max
=1000000)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning:
common_apply_stickiness: Forcing editshare.stack
.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web away from
es-efs-master1 after 1000000 failures (max
=1000000)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Recover editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave ->
Master es-efs-master1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Stop    editshare.stack.7c645b0e-46b
b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Recover editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0#011(Started
es-efs-master1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master2)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(Slave
-> Master es-efs-master2)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver#011(Started
es-efs-master1 -> es-efs-master2)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Move    editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server#011(Started
es-efs-master1 -> es-efs-master2)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(Started
es-efs-master1 -> es-efs-master2)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(Started
es-efs-master1 -> es-efs-master2)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
Recover fence-es-efs-master2-ipmi#011(Started es-efs-master1)
Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice:
process_pe_message: Calculated Transition 1481609:
/var/lib/pacemaker/pengine/pe-input-1362.bz2
Nov 18 08:36:21 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
(call=469, rc=0, cib-update=0, confirmed=true) ok
Nov 18 08:36:21 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_notify_0
(call=471, rc=0, cib-update=0, confirmed=true) ok

Not a whole lot happens after that for the next 40 minutes or so,
until someone manually reboots master1. At this point master2 sees it
disappear again, decides to fence it again, and this time is
successful:

Nov 18 09:11:42 es-efs-master2 corosync[1842]:   [TOTEM ] A new
membership (172.17.0.6:136) was formed.
Members left: 1
Nov 18 09:11:42 es-efs-master2 pacemakerd[1847]:   notice:
crm_update_peer_state: pcmk_quorum_notificati
on: Node es-efs-master1[1] - state is now lost (was member)
Nov 18 09:11:42 es-efs-master2 crmd[1924]:   notice:
crm_update_peer_state: pcmk_quorum_notification: No
de es-efs-master1[1] - state is now lost (was member)
Nov 18 09:11:42 es-efs-master2 mfsmount[946]: can't connect to master
("172.17.0.16":"9421")
Nov 18 09:11:42 es-efs-master2 mfsmount[946]: can't connect to master
("172.17.0.16":"9421")
Nov 18 09:11:43 es-efs-master2 mfsmaster[22800]: connecting to Master
Nov 18 09:11:43 es-efs-master2 mfsmaster[22800]: connection failed,
error: ECONNREFUSED (Connection refu
sed)
Nov 18 09:11:43 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754095.3 -> 0.27540
95.4 from local not applied to 0.2754095.3: Failed application of an update diff
Nov 18 09:11:44 es-efs-master2 mfsmaster[22800]: connecting to Master
Nov 18 09:11:44 es-efs-master2 mfsmaster[22800]: connection failed,
error: ECONNREFUSED (Connection refu
sed)
Nov 18 09:11:44 es-efs-master2 pengine[1923]:  warning: pe_fence_node:
Node es-efs-master1 will be fence
d because the node is no longer part of the cluster
Nov 18 09:11:44 es-efs-master2 pengine[1923]:  warning:
determine_online_status: Node es-efs-master1 is
unclean

Nov 18 09:12:47 es-efs-master2 mfsmaster[22800]: connecting to Master
Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]:   notice:
log_operation: Operation 'reboot' [27561] (call 3 from crmd.1924) for
host 'es-efs-master1' with device 'fence-es-efs-master1-ipmi'
returned: 0 (OK)
Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]:   notice:
remote_op_done: Operation reboot of es-efs-master1 by es-efs-master2
for crmd.1924 at es-efs-master2.276145eb: OK
Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754095.7 -> 0.2754095.8 from local not
applied to 0.2754095.7: Failed application of an update diff

Now we were able to start making progress again, restarting and
re-promoting resources on master2, and the system got into a running
state again:

Nov 18 09:12:47 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver_start_0
(call=485, rc=0, cib-update=2963781, confirmed=true) ok
Nov 18 09:12:47 es-efs-master2
IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]:
INFO: Adding inet address 172.17.0.16/22 to device bond0
Nov 18 09:12:47 es-efs-master2 avahi-daemon[1255]: Registering new
address record for 172.17.0.16 on bond0.IPv4.
Nov 18 09:12:47 es-efs-master2
IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]:
INFO: Bringing device bond0 up
Nov 18 09:12:47 es-efs-master2
IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]:
INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p
/var/run/resource-agents/send_arp-172.17.0.16 bond0 172.17.0.16 auto
not_used not_used

At this point we had some further problems that caused some more
downtime, but I don't believe they were related to Pacemaker at this
time.

What I'm concerned about is the initial failure of crmd on master1
that led to master2 deciding to fence it, and then master2's failure
to fence master1 and thus getting stuck and not being able to manage
resources. It seems to have simply stopped doing anything, with no
logs indicating why it did so.

We are running Pacemaker 1.1.10+git20130802-1ubuntu1 and Corosync
2.3.3-1ubuntu1 on a custom distro based on Ubuntu 12.04 LTS, with
these packages backported from Ubuntu 14.04, and a couple of local
patches applied to those packages to start the daemons via Upstart
rather than sysvinit scripts due to some bugs in the sysvinit scripts.

A couple of questions:

1. Is the original crash a known problem? Worth submitting a bug
about? Already fixed in a newer release of Pacemaker, and I should
just upgrade?
2. Is there any reason why fencing would just hang like this, with
nothing logged? Is there anywhere I'm not looking that would have more
information? I tried crm_simulate on the pe-input and pe-warn files
mentioned at various points, and it contained pretty much what I would
expect, such as planning to fence with other actions depending on the
fencing action. One thing I notice is that there are actions such as
"editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_pre_notify_stop_0
es-efs-master1" and
"editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_pre_notify_stop_0
es-efs-master1" that don't have dependencies on the stonith action;
I'm wondering if it's possible that any of these actions hanging (such
as not being able to perfom them on master1) could have at all caused
the stonith option to not proceed. Or, another possiblilty, could the
"Failed application of an update diff" message indicate that the
stonith daemon wasn't able to save some essential state to the CIB,
and thus the stonith action got lost somehow?
3. Is there anyone who provides anything like commercial developer
support for Pacemaker and Corosync? This is a failry complex issue to
be asking for free support on the mailing list; if this winds up being
too complicated for the venue or if I need to provide full logs that
may contain customer data, I'm wondering if there's anywhere to
escalate it. Since we roll our own custom distro and build our own
packages, we can't really have a traditional vendor contract with
RedHat or SuSE, so I'm wondering if there's anyone who offers direct
developer support and consulting, like SerNet does for Samba and
Netatalk.

Let me know if anything else is required to answer this. I've grabbed
all of the /var/lib/pacemaker/pengine files and syslogs from the time
of the incident, and can get access back to the system to collect
other data, so if there's anything that would be helpful to debug I
can probably grab it.

If you've gotten all the way down here, thanks for reading!

-- Brian