[Pacemaker] Pacemaker on system with disk failure

Tue Sep 23 09:50:12 EDT 2014

Can you share your pacemaker and drbd configurations please?

digimer

On 23/09/14 09:39 AM, Carsten Otto wrote:
> Hello,
>
> I run Corosync + Pacemaker + DRBD in a two node cluster, where all
> resources are part of a group/colocated with DRBD (DRBD + virtual IP +
> filesystem + ...). To test my configuration, I currently have two nodes
> with only a single disk drive. This drive is the only LVM physical
> drive in a LVM volume group, where the Linux system resides on some
> logical volumes and the disk exported by DRBD is another logical volume.
>
> When I now unplug power of the disk drive on the node running the
> resources (DRBD is primary), this gets noticed by DRBD ("diskless").
> Furthermore, I notice that my services do not work anymore (which is
> understandable without a working disk drive).
>
> However, in my experiments one of the following problems occurs:
>
> 1) The services are stopped and DRBD is demoted (according to pcs status
>     and pacemaker.log), however according to /proc/drbd on the surviving
>     node, the diskless node still is running as primary. As a
>     consequence, I see failing attempts to promote on the surviver node:
>
> drbd(DRBD)[1797]:       2014/09/23_14:35:56 ERROR: disk0: Called drbdadm -c /etc/drbd.conf primary disk0
> drbd(DRBD)[1797]:       2014/09/23_14:35:56 ERROR: disk0: Exit code 11
>
>     The problem here seems to be:
> crmd:     info: match_graph_event:  Action DRBD_demote_0 (12) confirmed on diskless_node (rc=0)
>
>     While this demote operation obviously should not be confirmed, I also
>     strongly believe that running the stop operations of the standard
>     resources works without having access to the resource agent scripts
>     (which are on the failed disk) and the tools used by them.
>
> 2) My services do not work anymore, but nothing happens in the cluster.
>     Everything looks like it did before the failure, with the only
>     difference that /proc/drbd shows "Diskless" and some "oos". It
>     seems corosync/pacemaker sends out "all is well" to the DC, while
>     internally (due to the missing disk) nothing works. I guess that
>     running all sorts of monitor scripts is problematic without having
>     access to the actual files, so I'd like to see some sort of failure
>     communicated from the diskless node to the surviving node (or, having
>     the surviving node come to the same conclusion due to some timeout).
>
> Is this buggy behaviour? How should a node behave if all disks stopped
> working?
>
> I can reproduce this. If you need details about the configuration or
> more output from pacemaker.log, please just tell me so.
>
> The versions reported by Centos 7:
>   corosync 2.3.3-2.el7
>   pacemaker 1.1.10-32.el7_0
>   drbd 8.4.5-1.el7.elrepo
>
> Thank you,
> Carsten
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?