[ClusterLabs] Antw: Fencing one node kill others

Wed Jan 4 08:06:34 EST 2017

Hi Ulrich,

You're right, it is as if stonithd selected the incorrect device to reboot
the node. I'm using fence_ilo as the stonith agent, and reviewing the
params it take is not clear which one (besides the name which is irrelevant
for stonithd) should be used to fence each node.

In cman+rgmanager you can associate a fence device params to each node, for
example:

<clusternode name="e1b07" nodeid="2">
     <fence>
       <method name="single">
         <device name="fence_ilo" ipaddr="e1b07-ilo"/>
       </method>
     </fence>
  </clusternode>

what's the equivalent of that in corosync+pacemaker (using crm)??

In general, in a cluster of more than 2 nodes and more than 2 stonith
devices, how stonithd find which stonith device should be used to fence a
specific node??

Regards,
  Ali

On Wed, Jan 4, 2017 at 2:27 AM, Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> wrote:

> Hi!
>
> A few messages that look uncommon to me are:
>
> crm_reap_dead_member:   Removing node with name unknown and id 1239211543
> from membership cache
>
> A bit later the node name is known:
> info: crm_update_peer_proc:     pcmk_cpg_membership: Node
> e1b13[1239211543] - corosync-cpg is now offline
>
> Another node seems to go offline also:
> crmd:     info: peer_update_callback:   Client e1b13/peer now has status
> [offline] (DC=e1b07, changed=4000000)
>
> This looks OK to me:
> stonith-ng:    debug: get_capable_devices:      Searching through 3
> devices to see what is capable of action (reboot) for target e1b13
> stonith-ng:    debug: stonith_action_create:    Initiating action status
> for agent fence_ilo (target=e1b13)
>
> This looks odd to me:
> stonith-ng:    debug: stonith_device_execute:   Operation status for node
> e1b13 on fence-e1b03 now running with pid=25689, timeout=20s
> stonith-ng:    debug: stonith_device_execute:   Operation status for node
> e1b13 on fence-e1b07 now running with pid=25690, timeout=20s
> stonith-ng:    debug: stonith_device_execute:   Operation status for node
> e1b13 on fence-e1b13 now running with pid=25691, timeout=20s
>
> Maybe not, because you can fence the node oin three different ways it
> seems:
> stonith-ng:    debug: stonith_query_capable_device_cb:  Found 3 matching
> devices for 'e1b13'
>
> Now it's geting odd:
> stonith-ng:    debug: schedule_stonith_command: Scheduling reboot on
> fence-e1b07 for remote peer e1b07 with op id (ae1956b5-ffe1-4d6a-b5a2-c7bba2c6d7fd)
> (timeout=60s)
> stonith-ng:    debug: stonith_action_create:    Initiating action reboot
> for agent fence_ilo (target=e1b13)
>  stonith-ng:    debug: stonith_device_execute:  Operation reboot for node
> e1b13 on fence-e1b07 now running with pid=25784, timeout=60s
>  crmd:     info: crm_update_peer_expected:      handle_request: Node
> e1b07[1239211582] - expected state is now down (was member)
> stonith-ng:    debug: st_child_done:    Operation 'reboot' on
> 'fence-e1b07' completed with rc=0 (2 remaining)
> stonith-ng:   notice: log_operation:    Operation 'reboot' [25784] (call 6
> from crmd.1201) for host 'e1b13' with device 'fence-e1b07' returned: 0 (OK)
> attrd:     info: crm_update_peer_proc:  pcmk_cpg_membership: Node
> e1b07[1239211582] - corosync-cpg is now offline
>
> To me it looks as if your STONITH agents kill the wrong node for reasons
> unknown to me.
>
> (didn't inspect the whole logs)
>
> Regards,
> Ulrich
>
>
> >>> Alfonso Ali <alfonso.ali at gmail.com> schrieb am 03.01.2017 um 18:54 in
> Nachricht
> <CANeoTMee-=_-Gtf_vxigKsrXNQ0pWEUAg=7YJHrhvrWDNthsmg at mail.gmail.com>:
> > Hi Ulrich,
> >
> > I'm using udpu and static node list. This is my corosync conf:
> >
> > --------------------Corosync
> > configuration-------------------------------------------------
> > totem {
> > version: 2
> > cluster_name: test-cluster
> > token: 3000
> > token_retransmits_before_loss_const: 10
> > clear_node_high_bit: yes
> > crypto_cipher: aes256
> > crypto_hash: sha1
> > transport: udpu
> >
> > interface {
> > ringnumber: 0
> > bindnetaddr: 201.220.222.0
> > mcastport: 5405
> > ttl: 1
> > }
> > }
> >
> > logging {
> > fileline: off
> > to_stderr: no
> > to_logfile: no
> > to_syslog: yes
> > syslog_facility: daemon
> > debug: on
> > timestamp: on
> > logger_subsys {
> > subsys: QUORUM
> > debug: on
> > }
> > }
> >
> > quorum {
> > provider: corosync_votequorum
> > expected_votes: 3
> > }
> >
> > nodelist {
> > node: {
> > ring0_addr: 201.220.222.62
> > }
> > node: {
> > ring0_addr: 201.220.222.23
> > }
> > node: {
> > ring0_addr: 201.220.222.61
> > }
> > node: {
> > ring0_addr: 201.220.222.22
> > }
> > }
> > --------------------------------/Corosync
> > conf-------------------------------------------------------
> >
> > The pacemaker log is very long, i'm sending it attached as a zip file,
> > don't know if the list will allow it, if not please tell me which
> sections
> > (stonith, crmd, lrmd, attrd, cib) should i post.
> >
> > For a better understanding, the cluster have 4 nodes, e1b03, e1b07, e1b12
> > and e1b13. I simulated a crash on e1b13 with:
> >
> > echo c > /proc/sysrq-trigger
> >
> > the cluster detected e1b13 as crashed and reboot it, but after that e1b07
> > was restarted too, and later e1b03 did the same, the only node that
> remined
> > alive was e1b12. The attached log was taken from that node.
> >
> > Let me know if any other info is needed to debug the problem.
> >
> > Regards,
> >   Ali
> >
> >
> >
> > On Mon, Jan 2, 2017 at 3:30 AM, Ulrich Windl <
> > Ulrich.Windl at rz.uni-regensburg.de> wrote:
> >
> >> Hi!
> >>
> >> Seeing the detailed log of events would be helpful. Despite of that we
> had
> >> a similar issue with using multicast (and after adding a new node to an
> >> existing cluster). Switching to UDPU helped in our case, but unless we
> see
> >> the details, it's all just guessing...
> >>
> >> Ulrich
> >> P.S. A good new year to everyone!
> >>
> >> >>> Alfonso Ali <alfonso.ali at gmail.com> schrieb am 30.12.2016 um 21:40
> in
> >> Nachricht
> >> <CANeoTMcuNGw_T9e4WNEEK-nmHnV-NwiX2Ck0UBDnVeuoiC=r8A at mail.gmail.com>:
> >> > Hi,
> >> >
> >> > I have a four node cluster that uses iLo as fencing agent. When i
> >> simulate
> >> > a node crash (either killing corosync or echo c > /proc/sysrq-trigger)
> >> the
> >> > node is marked as UNCLEAN and requested to be restarted by the stonith
> >> > agent, but everytime that happens another node in the cluster is also
> >> > marked as UNCLEAN and rebooted as well. After the nodes are rebooted
> they
> >> > are marked as online again and cluster resume operation without
> problem.
> >> >
> >> > I have reviewed corosync and pacemaker logs but found nothing that
> >> explain
> >> > why the other node is also rebooted.
> >> >
> >> > Any hint of what to check or what to look for would be appreciated.
> >> >
> >> > -----------------Cluster conf----------------------------------
> >> >  node 1239211542: e1b12 \
> >> > attributes standby=off
> >> > node 1239211543: e1b13
> >> > node 1239211581: e1b03 \
> >> > attributes standby=off
> >> > node 1239211582: e1b07 \
> >> > attributes standby=off
> >> > primitive fence-e1b03 stonith:fence_ilo \
> >> > params ipaddr=e1b03-ilo login=fence_agent passwd=XXX ssl_insecure=1 \
> >> > op monitor interval=300 timeout=120 \
> >> > meta migration-threshold=2 target-role=Started
> >> > primitive fence-e1b07 stonith:fence_ilo \
> >> > params ipaddr=e1b07-ilo login=fence_agent passwd=XXX ssl_insecure=1 \
> >> > op monitor interval=300 timeout=120 \
> >> > meta migration-threshold=2 target-role=Started
> >> > primitive fence-e1b12 stonith:fence_ilo \
> >> > params ipaddr=e1b12-ilo login=fence_agent passwd=XXX ssl_insecure=1 \
> >> > op monitor interval=300 timeout=120 \
> >> > meta migration-threshold=2 target-role=Started
> >> > primitive fence-e1b13 stonith:fence_ilo \
> >> > params ipaddr=e1b13-ilo login=fence_agent passwd=XXX ssl_insecure=1 \
> >> > op monitor interval=300 timeout=120 \
> >> > meta migration-threshold=2 target-role=Started
> >> > ..... extra resources ......
> >> > location l-f-e1b03 fence-e1b03 \
> >> > rule -inf: #uname eq e1b03 \
> >> > rule 10000: #uname eq e1b07
> >> > location l-f-e1b07 fence-e1b07 \
> >> > rule -inf: #uname eq e1b07 \
> >> > rule 10000: #uname eq e1b03
> >> > location l-f-e1b12 fence-e1b12 \
> >> > rule -inf: #uname eq e1b12 \
> >> > rule 10000: #uname eq e1b13
> >> > location l-f-e1b13 fence-e1b13 \
> >> > rule -inf: #uname eq e1b13 \
> >> > rule 10000: #uname eq e1b12
> >> > property cib-bootstrap-options: \
> >> > have-watchdog=false \
> >> > dc-version=1.1.15-e174ec8 \
> >> > cluster-infrastructure=corosync \
> >> > stonith-enabled=true \
> >> > cluster-name=test-cluster \
> >> > no-quorum-policy=freeze \
> >> > last-lrm-refresh=1483125286
> >> > ------------------------------------------------------------
> >> ----------------
> >> > ------------
> >> >
> >> > Regards,
> >> >   Ali
> >>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20170104/6b9cfa64/attachment-0003.html>