[ClusterLabs] Pacemaker crash and fencing failure

Mon Nov 30 16:53:28 EST 2015

On 11/20/2015 06:38 PM, Brian Campbell wrote:
> I've been trying to debug and do a root cause analysis for a cascading
> series of failures that a customer hit a couple of days ago, that
> caused their filesystem to be unavailable for a couple of hours.
> 
> The original failure was in our own distributed filesystem backend, a
> fork of LizardFS, which is in turn a fork of MoosFS This history is
> mostly only important in reading the logs, where "efs", "lizardfs",
> and "mfs" all generally refer to the same services, just different
> generations of naming them as not all daemons, scripts, and packages
> have been renamed.
> 
> There are two master servers that handle metadata operations, running
> Pacemaker to elect which one is the current primary and which one is a
> replica, and a number of chunkservers that store file chunks and
> simply connect to the current running master via a virtual IP. A bug
> in doing a checksum scan on the chunkservers caused them to leak file
> descriptors and become unresponsive, so while the master server was up
> and healthy, no actual filesystem operations could occur. (This bug is
> now fixed by the way, and the fix deployed to the customer, but we
> want to debug why the later failures occurred that caused them to
> continue to have downtime).
> 
> The customer saw that things were unresponsive, and tried doing the
> simplest thing they could to try to resolve it, migrate the services
> to the other master. This succeeded, as the checksum scan had been
> initiated by the first master and so switching over to the replica
> caused all of the extra file descriptors to be closed and the
> chunkservers to become responsive again.
> 
> However, due to one backup service that is not yet managed via
> Pacemaker and thus is only running on the first master, they decided
> to migrate back to the first master. This was when they ran into a
> Pacemaker problem.
> 
> At the time of the problem, es-efs-master1 is the server that was
> originally the master when the first problem happened, and which they
> are trying to migrate the services back to. es-efs-master2 is the one
> actively running the services, and also happens to be the DC at the
> time to that's where to look for pengine messages.
> 
> On master2, you can see the point when the user tried to migrate back
> to master1 based on the pengine decisions:
> 
> (by the way, apologies for the long message with large log excerpts; I
> was trying to balance enough detail with not overwhelming, it can be
> hard to keep it short when explaining these kinds of complicated
> failures across a number of machines)

Indeed. In a case as complex as this, more is better -- a copy of the
configuration and the output of crm_report would be helpful (with any
sensitive info removed).

> Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-
> 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
> a failed demote action

Any idea why the demote failed?

> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Move    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
> es-efs-master2 -> es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
> -> Master es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Demote  editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
> -> Slave es-efs-master2)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
> process_pe_message: Calculated Transition 1481601:
> /var/lib/pacemaker/pengine/pe-input-1355.bz2
> Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]:  warning:
> cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not
> applied to 0.2754083.1: Failed application of an update diff

Normally failed diffs shouldn't cause any problems, other than extra
network/disk traffic for a full CIB sync.

There have been many bugs fixed since 1.1.10, and I think some of them
relate to cib diffs. Since you're doing everything custom, you might as
well use latest upstream libqb/corosync/pacemaker releases (btw current
pacemaker master branch has been stable and will soon become the basis
of 1.1.14rc1).

Have you been able to reproduce the issue on a test cluster? That would
be important to investigating further. You could set PCMK_debug=yes (I'm
guessing that would be in /etc/default/pacemaker on ubuntu) to get more
logs.

> Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
> (call=400, rc=0, cib-update=0, confirmed=true) ok
> Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice: run_graph:
> Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15,
> Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-1355.bz2):
> Stopped
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op demote for
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
> on es-efs-master2: unknown error (1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
> to stop after a failed demote action
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Move    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
> es-efs-master2 -> es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
> -> Master es-efs-master1)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
> Demote  editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
> -> Slave es-efs-master2)
> Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
> process_pe_message: Calculated Transition 1481602:
> /var/lib/pacemaker/pengine/pe-input-1356.bz2
> Nov 18 08:28:28 es-efs-master2 crmd[1924]:  warning: destroy_action:
> Cancelling timer for action 14 (src=4445462)
> Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
> (call=403, rc=0, cib-update=0, confirmed=true) ok
> 
> 
> A little later on, we see that we have successfully demoted our
> lizardfs-master service:
> 
> Nov 18 08:28:31 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_demote_0
> (call=406, rc=0, cib-update=2963647, confirmed=true) ok
> 
> The logs go on for a few seconds, as we are migrating a number of
> other services as well (a variety of management services, LDAP, and so
> on). If we switch back to our master1 logs, a few seconds later we
> some failures of crmd talking to lrmd:
> 
> Nov 18 08:28:39 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
> Request 393 to lrmd (0x7f51b85f8e30) failed: Resource temporarily
> unavailable (-11)

This is bad; it seems like lrmd stops responding entirely. Are there any
interesting lrmd logs before this?

Again, if you can reproduce the issue with your existing versions, I'd
recommend testing whether it still occurs with a newer version.

> Nov 18 08:28:39 es-efs-master1 crmd[1826]:    error:
> lrmd_send_command: Couldn't perform lrmd_rsc_info operation
> (timeout=0): -11: Connection timed out (110)
> Nov 18 08:28:40 es-efs-master1 mfsmaster[20053]: connecting to Master
> Nov 18 08:28:40 es-efs-master1 mfsmaster[20053]: connected to Master
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
> Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
> recieved
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:    error:
> lrmd_send_command: Couldn't perform lrmd_rsc_register operation
> (timeout=0): -114: Connection timed out (110)
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
> Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
> recieved
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:    error:
> lrmd_send_command: Couldn't perform lrmd_rsc_info operation
> (timeout=0): -114: Connection timed out (110)
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
> bad input   <create_request_adv origin="te_rsc_command" t="crmd"
> version="3.0.7" subt="request"
> reference="lrm_invoke-tengine-1447853313-1481932"
> crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine"
> crm_host_to="es-efs-master1" src="es-efs-master2">
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
> bad input     <crm_xml>
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
> bad input       <rsc_op id="87" operation="start"
> operation_key="editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd_start_0"
> on_node="es-efs-master1" on_node_uuid="1"
> transition-key="87:1481605:0:df645faa-fb80-4aa4-89dd-efa8090f9807">
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
> bad input         <primitive
> id="editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd"
> class="upstart" type="editshare-dbd"/>
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
> bad input         <attributes CRM_meta_timeout="20000"
> crm_feature_set="3.0.7"/>
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
> bad input       </rsc_op>
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
> bad input     </crm_xml>
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: do_lrm_invoke:
> bad input   </create_request_adv>
> Nov 18 08:28:40 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
> Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
> recieved
> 
> A few more seconds later, and it looks like crmd gives up and dies, at
> which point pacemakerd notices and restarts it:
> 
> Nov 18 08:28:43 es-efs-master1 crmd[1826]:  warning: do_log: FSA:
> Input I_FAIL from get_lrm_resource() received in state S_NOT_DC
> Nov 18 08:28:43 es-efs-master1 crmd[1826]:   notice:
> do_state_transition: State transition S_NOT_DC -> S_RECOVERY [
> input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ]
> Nov 18 08:28:43 es-efs-master1 crmd[1826]:  warning: do_log: FSA:
> Input I_FAIL from get_lrm_resource() received in state S_RECOVERY
> Nov 18 08:28:43  crmd[1826]: last message repeated 2 times
> Nov 18 08:28:43 es-efs-master1 crmd[1826]:    error: do_log: FSA:
> Input I_TERMINATE from do_recover() received in state S_RECOVERY
> Nov 18 08:28:44 es-efs-master1 mfsmaster[20053]: connecting to Master
> Nov 18 08:28:44 es-efs-master1 mfsmaster[20053]: connected to Master
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
> Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
> recieved
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
> lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation
> (timeout=0): -114: Connection timed out (110)
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
> Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
> recieved
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
> lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation
> (timeout=0): -114: Connection timed out (110)
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:  warning: crm_ipc_send:
> Sending to lrmd (0x7f51b85f8e30) is disabled until pending reply is
> recieved
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
> lrmd_send_command: Couldn't perform lrmd_rsc_cancel operation
> (timeout=0): -114: Connection timed out (110)
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
> lrm_state_verify_stopped: 2 pending LRM operations at shutdown(null)
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
> lrm_state_verify_stopped: Pending action:
> editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server:287
> (editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server_start_0)
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:    error:
> lrm_state_verify_stopped: Pending action:
> editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver:285
> (editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver_start_0)
> Nov 18 08:28:44 es-efs-master1 crmd[1826]:   notice:
> terminate_cs_connection: Disconnecting from Corosync
> Nov 18 08:28:44 es-efs-master1 cib[1821]:  warning:
> qb_ipcs_event_sendv: new_event_notification (1821-1826-10): Broken
> pipe (32)
> Nov 18 08:28:44 es-efs-master1 pacemakerd[1760]:    error:
> pcmk_child_exit: Child process crmd (1826) exited: Generic Pacemaker
> error (201)
> Nov 18 08:28:44 es-efs-master1 crmd[7128]:   notice:
> crm_cluster_connect: Connecting to cluster infrastr
> ucture: corosync
> Nov 18 08:28:44 es-efs-master1 crmd[7128]:   notice:
> corosync_node_name: Unable to get node name for nod
> eid 1
> Nov 18 08:28:44 es-efs-master1 crmd[7128]:   notice: get_node_name:
> Defaulting to uname -n for the local
>  corosync node name
> Nov 18 08:28:44 es-efs-master1 crmd[7128]:   notice:
> cluster_connect_quorum: Quorum acquired
> 
> Back over on master2, this causes it to decide to initiate a fencing action:
> 
> Nov 18 08:28:44 es-efs-master2 crmd[1924]:   notice: run_graph:
> Transition 1481605 (Complete=6, Pending=
> 0, Fired=0, Skipped=38, Incomplete=12,
> Source=/var/lib/pacemaker/pengine/pe-input-1359.bz2): Stopped
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning: pe_fence_node:
> Node es-efs-master1 will be fence
> d because our peer process is no longer available
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning:
> determine_online_status: Node es-efs-master1 is
> unclean
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op demote for e
> ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
> on es-efs-master2: unkno
> wn error (1)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-
> 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
> a failed demote action
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning: stage6:
> Scheduling Node es-efs-master1 for STONI
> TH
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Move    editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master1 -> es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Move    editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Master
> es-efs-master1 -> es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave ->
> Master es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Stop    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Stop    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0#011(es-efs-master1)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Stop    editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(es-efs-master1)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver#011(es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server#011(es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master2)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:   notice: LogActions:
> Stop    fence-es-efs-master2-ipmi#011(es-efs-master1)
> Nov 18 08:28:44 es-efs-master2 pengine[1923]:  warning:
> process_pe_message: Calculated Transition 1481606:
> /var/lib/pacemaker/pengine/pe-warn-374.bz2
> Nov 18 08:28:44 es-efs-master2 crmd[1924]:   notice: run_graph:
> Transition 1481606 (Complete=0, Pending=0, Fired=0, Skipped=43,
> Incomplete=20, Source=/var/lib/pacemaker/pengine/pe-warn-374.bz2):
> Stopped
> 
> However, at this point, nothing happens. There are no further logs
> about fencing, stonith, ipmi, for another few minutes, other than a
> couple of occasional warnings like the following:

This doesn't make sense. You should see a crmd log message about
"Executing ... fencing operation" and then stonith-ng logs about the
actual execution. I can't think of any reason that wouldn't happen, even
if fencing failed. I'd have to see more logs, but again, it would be
worth trying to reproduce with newer versions.

> Nov 18 08:28:44 es-efs-master2 stonith-ng[1920]:  warning:
> cib_process_diff: Diff 0.2754086.12 -> 0.2754086.13 from local not
> applied to 0.2754086.12: Failed application of an update diff
> 
> Nov 18 08:34:58 es-efs-master2 stonith-ng[1920]:  warning:
> cib_process_diff: Diff 0.2754086.20 -> 0.2754086.21 from local not
> applied to 0.2754086.20: Failed application of an update diff
> 
> Nov 18 08:34:59 es-efs-master2 stonith-ng[1920]:  warning:
> cib_process_diff: Diff 0.2754086.25 -> 0.2754086.26 from local not
> applied to 0.2754086.25: Failed application of an update diff
> 
> We then see pengine react as if master1 is back online again, even
> though it was never fenced:
> 
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op demote for e
> ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0
> on es-efs-master2: unkno
> wn error (1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-
> 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0 to stop after
> a failed demote action
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.ip#011(es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Stop    editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(es-efs-master2)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.slapd.primitive:1#011(es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc
> 0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Demote  editshare.stack.ffab6f74-2fc
> 0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(Master -> Slave es-efs-master2)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc
> 0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote editshare.stack.ffab6f74-2fc
> 0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(Stopped -> Master es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc
> 0-43e8-99f4-ab446db23ac3.editshare-superserver#011(es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc
> 0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc
> 0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   fence-es-efs-master2-ipmi#011(es-efs-master1)
> Nov 18 08:35:00 es-efs-master2 pengine[1923]:   notice:
> process_pe_message: Calculated Transition 1481607:
> /var/lib/pacemaker/pengine/pe-input-1360.bz2
> 
> Nov 18 08:35:01 es-efs-master2 crmd[1924]:   notice: run_graph:
> Transition 1481607 (Complete=26, Pending=0, Fired=0, Skipped=35,
> Incomplete=18, Source=/var/lib/pacemaker/pengine/pe-input-1360.bz2):
> Stopped
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op demote for
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
> on es-efs-master2: unknown error (1)
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
> to stop after a failed demote action
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
> -> Master es-efs-master1)
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
> Stop    editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2)
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master1)
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:0#011(Slave
> -> Master es-efs-master1)
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(es-efs-master1)
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(es-efs-master1)
> Nov 18 08:35:01 es-efs-master2 pengine[1923]:   notice:
> process_pe_message: Calculated Transition 1481608:
> /var/lib/pacemaker/pengine/pe-input-1361.bz2
> Nov 18 08:35:01 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
> (call=463, rc=0, cib-update=0, confirmed=true) ok
> Nov 18 08:35:01 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_notify_0
> (call=465, rc=0, cib-update=0, confirmed=true) ok
> 
> And it just keeps on going on, seemingly getting stuck because
> everything on master1 is giving unknown errors:

This has no obvious cause either. The interesting parts would be earlier
when crmd asked lrmd to execute the actions. You could check the
corresponding master1 logs to see whether it's getting the lrmd requests
successfully and what it tries to do with them if so.

> Nov 18 08:36:21 es-efs-master2 crmd[1924]:   notice: run_graph:
> Transition 1481608 (Complete=12, Pending
> =0, Fired=0, Skipped=19, Incomplete=13,
> Source=/var/lib/pacemaker/pengine/pe-input-1361.bz2): Stopped
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op monitor for
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip on
> es-efs-master1: unknown error (1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op monitor for
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0
> on es-efs-master1: unknown error
> (1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op monitor for
> editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver
> on es-efs-master1: unknown er
> ror (1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op start for ed
> itshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd on
> es-efs-master1: unknown error (1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op start for ed
> itshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web on
> es-efs-master1: unknown error (1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op monitor for
> fence-es-efs-master2-ipmi on es-efs-master1: unknown error (1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Processing failed op demote for e
> ditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
> on es-efs-master2: unkno
> wn error (1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
> Forcing editshare.stack.7c645b0e-
> 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
> a failed demote action
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning:
> common_apply_stickiness: Forcing editshare.stack
> .ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd away from
> es-efs-master1 after 1000000 failures (max
> =1000000)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:  warning:
> common_apply_stickiness: Forcing editshare.stack
> .ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web away from
> es-efs-master1 after 1000000 failures (max
> =1000000)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Recover editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave ->
> Master es-efs-master1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Stop    editshare.stack.7c645b0e-46b
> b-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(es-efs-master2)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Recover editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.slapd.primitive:0#011(Started
> es-efs-master1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Start   editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.ip#011(es-efs-master2)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Promote editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive:1#011(Slave
> -> Master es-efs-master2)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver#011(Started
> es-efs-master1 -> es-efs-master2)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Move    editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-server#011(Started
> es-efs-master1 -> es-efs-master2)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-dbd#011(Started
> es-efs-master1 -> es-efs-master2)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Recover editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-web#011(Started
> es-efs-master1 -> es-efs-master2)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice: LogActions:
> Recover fence-es-efs-master2-ipmi#011(Started es-efs-master1)
> Nov 18 08:36:21 es-efs-master2 pengine[1923]:   notice:
> process_pe_message: Calculated Transition 1481609:
> /var/lib/pacemaker/pengine/pe-input-1362.bz2
> Nov 18 08:36:21 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
> (call=469, rc=0, cib-update=0, confirmed=true) ok
> Nov 18 08:36:21 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_notify_0
> (call=471, rc=0, cib-update=0, confirmed=true) ok
> 
> Not a whole lot happens after that for the next 40 minutes or so,
> until someone manually reboots master1. At this point master2 sees it
> disappear again, decides to fence it again, and this time is
> successful:
> 
> Nov 18 09:11:42 es-efs-master2 corosync[1842]:   [TOTEM ] A new
> membership (172.17.0.6:136) was formed.
> Members left: 1
> Nov 18 09:11:42 es-efs-master2 pacemakerd[1847]:   notice:
> crm_update_peer_state: pcmk_quorum_notificati
> on: Node es-efs-master1[1] - state is now lost (was member)
> Nov 18 09:11:42 es-efs-master2 crmd[1924]:   notice:
> crm_update_peer_state: pcmk_quorum_notification: No
> de es-efs-master1[1] - state is now lost (was member)
> Nov 18 09:11:42 es-efs-master2 mfsmount[946]: can't connect to master
> ("172.17.0.16":"9421")
> Nov 18 09:11:42 es-efs-master2 mfsmount[946]: can't connect to master
> ("172.17.0.16":"9421")
> Nov 18 09:11:43 es-efs-master2 mfsmaster[22800]: connecting to Master
> Nov 18 09:11:43 es-efs-master2 mfsmaster[22800]: connection failed,
> error: ECONNREFUSED (Connection refu
> sed)
> Nov 18 09:11:43 es-efs-master2 stonith-ng[1920]:  warning:
> cib_process_diff: Diff 0.2754095.3 -> 0.27540
> 95.4 from local not applied to 0.2754095.3: Failed application of an update diff
> Nov 18 09:11:44 es-efs-master2 mfsmaster[22800]: connecting to Master
> Nov 18 09:11:44 es-efs-master2 mfsmaster[22800]: connection failed,
> error: ECONNREFUSED (Connection refu
> sed)
> Nov 18 09:11:44 es-efs-master2 pengine[1923]:  warning: pe_fence_node:
> Node es-efs-master1 will be fence
> d because the node is no longer part of the cluster
> Nov 18 09:11:44 es-efs-master2 pengine[1923]:  warning:
> determine_online_status: Node es-efs-master1 is
> unclean
> 
> Nov 18 09:12:47 es-efs-master2 mfsmaster[22800]: connecting to Master
> Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]:   notice:
> log_operation: Operation 'reboot' [27561] (call 3 from crmd.1924) for
> host 'es-efs-master1' with device 'fence-es-efs-master1-ipmi'
> returned: 0 (OK)
> Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]:   notice:
> remote_op_done: Operation reboot of es-efs-master1 by es-efs-master2
> for crmd.1924 at es-efs-master2.276145eb: OK
> Nov 18 09:12:47 es-efs-master2 stonith-ng[1920]:  warning:
> cib_process_diff: Diff 0.2754095.7 -> 0.2754095.8 from local not
> applied to 0.2754095.7: Failed application of an update diff
> 
> Now we were able to start making progress again, restarting and
> re-promoting resources on master2, and the system got into a running
> state again:
> 
> Nov 18 09:12:47 es-efs-master2 crmd[1924]:   notice:
> process_lrm_event: LRM operation
> editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.editshare-superserver_start_0
> (call=485, rc=0, cib-update=2963781, confirmed=true) ok
> Nov 18 09:12:47 es-efs-master2
> IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]:
> INFO: Adding inet address 172.17.0.16/22 to device bond0
> Nov 18 09:12:47 es-efs-master2 avahi-daemon[1255]: Registering new
> address record for 172.17.0.16 on bond0.IPv4.
> Nov 18 09:12:47 es-efs-master2
> IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]:
> INFO: Bringing device bond0 up
> Nov 18 09:12:47 es-efs-master2
> IPaddr2(editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip)[28258]:
> INFO: /usr/lib/heartbeat/send_arp -i 200 -r 5 -p
> /var/run/resource-agents/send_arp-172.17.0.16 bond0 172.17.0.16 auto
> not_used not_used
> 
> At this point we had some further problems that caused some more
> downtime, but I don't believe they were related to Pacemaker at this
> time.
> 
> What I'm concerned about is the initial failure of crmd on master1
> that led to master2 deciding to fence it, and then master2's failure
> to fence master1 and thus getting stuck and not being able to manage
> resources. It seems to have simply stopped doing anything, with no
> logs indicating why it did so.
> 
> We are running Pacemaker 1.1.10+git20130802-1ubuntu1 and Corosync
> 2.3.3-1ubuntu1 on a custom distro based on Ubuntu 12.04 LTS, with
> these packages backported from Ubuntu 14.04, and a couple of local
> patches applied to those packages to start the daemons via Upstart
> rather than sysvinit scripts due to some bugs in the sysvinit scripts.
> 
> A couple of questions:
> 
> 1. Is the original crash a known problem? Worth submitting a bug
> about? Already fixed in a newer release of Pacemaker, and I should
> just upgrade?

addressed above

> 2. Is there any reason why fencing would just hang like this, with
> nothing logged? Is there anywhere I'm not looking that would have more
> information? I tried crm_simulate on the pe-input and pe-warn files
> mentioned at various points, and it contained pretty much what I would
> expect, such as planning to fence with other actions depending on the
> fencing action. One thing I notice is that there are actions such as
> "editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_pre_notify_stop_0
> es-efs-master1" and
> "editshare.stack.ffab6f74-2fc0-43e8-99f4-ab446db23ac3.mysql.primitive_pre_notify_stop_0
> es-efs-master1" that don't have dependencies on the stonith action;
> I'm wondering if it's possible that any of these actions hanging (such
> as not being able to perfom them on master1) could have at all caused
> the stonith option to not proceed. Or, another possiblilty, could the
> "Failed application of an update diff" message indicate that the
> stonith daemon wasn't able to save some essential state to the CIB,
> and thus the stonith action got lost somehow?

No, there's no reason in any of that to cause problems with fencing. As
long as master2 can talk to master1's IPMI interface, fencing should
proceed regardless of all that.

> 3. Is there anyone who provides anything like commercial developer
> support for Pacemaker and Corosync? This is a failry complex issue to
> be asking for free support on the mailing list; if this winds up being
> too complicated for the venue or if I need to provide full logs that
> may contain customer data, I'm wondering if there's anywhere to
> escalate it. Since we roll our own custom distro and build our own
> packages, we can't really have a traditional vendor contract with
> RedHat or SuSE, so I'm wondering if there's anyone who offers direct
> developer support and consulting, like SerNet does for Samba and
> Netatalk.

Not that I'm aware of, but if so, they're probably on this list. :) You
might ask whether one of the traditional vendors would do so on a
consulting basis. Also, the developers at clusterlabs.org mailing list
could be helpful in some cases, though the readership is mostly a subset
of this one.

> Let me know if anything else is required to answer this. I've grabbed
> all of the /var/lib/pacemaker/pengine files and syslogs from the time
> of the incident, and can get access back to the system to collect
> other data, so if there's anything that would be helpful to debug I
> can probably grab it.
> 
> If you've gotten all the way down here, thanks for reading!