[ClusterLabs] Antw: Fencing one node kill others

Wed Jan 4 08:19:03 EST 2017

On 01/04/2017 02:06 PM, Alfonso Ali wrote:
> Hi Ulrich,
>
> You're right, it is as if stonithd selected the incorrect device to
> reboot the node. I'm using fence_ilo as the stonith agent, and
> reviewing the params it take is not clear which one (besides the name
> which is irrelevant for stonithd) should be used to fence each node. 
>
> In cman+rgmanager you can associate a fence device params to each
> node, for example:
>
> <clusternode name="e1b07" nodeid="2">
>      <fence>
>        <method name="single">
>          <device name="fence_ilo" ipaddr="e1b07-ilo"/>
>        </method>
>      </fence>
>   </clusternode>
>
> what's the equivalent of that in corosync+pacemaker (using crm)??
>
> In general, in a cluster of more than 2 nodes and more than 2 stonith
> devices, how stonithd find which stonith device should be used to
> fence a specific node??

You have the attributes pcmk_host_list & pcmk_host_map to control that.

>
> Regards,
>   Ali
>
> On Wed, Jan 4, 2017 at 2:27 AM, Ulrich Windl
> <Ulrich.Windl at rz.uni-regensburg.de
> <mailto:Ulrich.Windl at rz.uni-regensburg.de>> wrote:
>
>     Hi!
>
>     A few messages that look uncommon to me are:
>
>     crm_reap_dead_member:   Removing node with name unknown and id
>     1239211543 from membership cache
>
>     A bit later the node name is known:
>     info: crm_update_peer_proc:     pcmk_cpg_membership: Node
>     e1b13[1239211543] - corosync-cpg is now offline
>
>     Another node seems to go offline also:
>     crmd:     info: peer_update_callback:   Client e1b13/peer now has
>     status [offline] (DC=e1b07, changed=4000000)
>
>     This looks OK to me:
>     stonith-ng:    debug: get_capable_devices:      Searching through
>     3 devices to see what is capable of action (reboot) for target e1b13
>     stonith-ng:    debug: stonith_action_create:    Initiating action
>     status for agent fence_ilo (target=e1b13)
>
>     This looks odd to me:
>     stonith-ng:    debug: stonith_device_execute:   Operation status
>     for node e1b13 on fence-e1b03 now running with pid=25689, timeout=20s
>     stonith-ng:    debug: stonith_device_execute:   Operation status
>     for node e1b13 on fence-e1b07 now running with pid=25690, timeout=20s
>     stonith-ng:    debug: stonith_device_execute:   Operation status
>     for node e1b13 on fence-e1b13 now running with pid=25691, timeout=20s
>
>     Maybe not, because you can fence the node oin three different ways
>     it seems:
>     stonith-ng:    debug: stonith_query_capable_device_cb:  Found 3
>     matching devices for 'e1b13'
>
>     Now it's geting odd:
>     stonith-ng:    debug: schedule_stonith_command: Scheduling reboot
>     on fence-e1b07 for remote peer e1b07 with op id
>     (ae1956b5-ffe1-4d6a-b5a2-c7bba2c6d7fd) (timeout=60s)
>     stonith-ng:    debug: stonith_action_create:    Initiating action
>     reboot for agent fence_ilo (target=e1b13)
>      stonith-ng:    debug: stonith_device_execute:  Operation reboot
>     for node e1b13 on fence-e1b07 now running with pid=25784, timeout=60s
>      crmd:     info: crm_update_peer_expected:      handle_request:
>     Node e1b07[1239211582] - expected state is now down (was member)
>     stonith-ng:    debug: st_child_done:    Operation 'reboot' on
>     'fence-e1b07' completed with rc=0 (2 remaining)
>     stonith-ng:   notice: log_operation:    Operation 'reboot' [25784]
>     (call 6 from crmd.1201) for host 'e1b13' with device 'fence-e1b07'
>     returned: 0 (OK)
>     attrd:     info: crm_update_peer_proc:  pcmk_cpg_membership: Node
>     e1b07[1239211582] - corosync-cpg is now offline
>
>     To me it looks as if your STONITH agents kill the wrong node for
>     reasons unknown to me.
>
>     (didn't inspect the whole logs)
>
>     Regards,
>     Ulrich
>
>
>     >>> Alfonso Ali <alfonso.ali at gmail.com
>     <mailto:alfonso.ali at gmail.com>> schrieb am 03.01.2017 um 18:54 in
>     Nachricht
>     <CANeoTMee-=_-Gtf_vxigKsrXNQ0pWEUAg=7YJHrhvrWDNthsmg at mail.gmail.com
>     <mailto:7YJHrhvrWDNthsmg at mail.gmail.com>>:
>     > Hi Ulrich,
>     >
>     > I'm using udpu and static node list. This is my corosync conf:
>     >
>     > --------------------Corosync
>     > configuration-------------------------------------------------
>     > totem {
>     > version: 2
>     > cluster_name: test-cluster
>     > token: 3000
>     > token_retransmits_before_loss_const: 10
>     > clear_node_high_bit: yes
>     > crypto_cipher: aes256
>     > crypto_hash: sha1
>     > transport: udpu
>     >
>     > interface {
>     > ringnumber: 0
>     > bindnetaddr: 201.220.222.0
>     > mcastport: 5405
>     > ttl: 1
>     > }
>     > }
>     >
>     > logging {
>     > fileline: off
>     > to_stderr: no
>     > to_logfile: no
>     > to_syslog: yes
>     > syslog_facility: daemon
>     > debug: on
>     > timestamp: on
>     > logger_subsys {
>     > subsys: QUORUM
>     > debug: on
>     > }
>     > }
>     >
>     > quorum {
>     > provider: corosync_votequorum
>     > expected_votes: 3
>     > }
>     >
>     > nodelist {
>     > node: {
>     > ring0_addr: 201.220.222.62
>     > }
>     > node: {
>     > ring0_addr: 201.220.222.23
>     > }
>     > node: {
>     > ring0_addr: 201.220.222.61
>     > }
>     > node: {
>     > ring0_addr: 201.220.222.22
>     > }
>     > }
>     > --------------------------------/Corosync
>     > conf-------------------------------------------------------
>     >
>     > The pacemaker log is very long, i'm sending it attached as a zip
>     file,
>     > don't know if the list will allow it, if not please tell me
>     which sections
>     > (stonith, crmd, lrmd, attrd, cib) should i post.
>     >
>     > For a better understanding, the cluster have 4 nodes, e1b03,
>     e1b07, e1b12
>     > and e1b13. I simulated a crash on e1b13 with:
>     >
>     > echo c > /proc/sysrq-trigger
>     >
>     > the cluster detected e1b13 as crashed and reboot it, but after
>     that e1b07
>     > was restarted too, and later e1b03 did the same, the only node
>     that remined
>     > alive was e1b12. The attached log was taken from that node.
>     >
>     > Let me know if any other info is needed to debug the problem.
>     >
>     > Regards,
>     >   Ali
>     >
>     >
>     >
>     > On Mon, Jan 2, 2017 at 3:30 AM, Ulrich Windl <
>     > Ulrich.Windl at rz.uni-regensburg.de
>     <mailto:Ulrich.Windl at rz.uni-regensburg.de>> wrote:
>     >
>     >> Hi!
>     >>
>     >> Seeing the detailed log of events would be helpful. Despite of
>     that we had
>     >> a similar issue with using multicast (and after adding a new
>     node to an
>     >> existing cluster). Switching to UDPU helped in our case, but
>     unless we see
>     >> the details, it's all just guessing...
>     >>
>     >> Ulrich
>     >> P.S. A good new year to everyone!
>     >>
>     >> >>> Alfonso Ali <alfonso.ali at gmail.com
>     <mailto:alfonso.ali at gmail.com>> schrieb am 30.12.2016 um 21:40 in
>     >> Nachricht
>     >>
>     <CANeoTMcuNGw_T9e4WNEEK-nmHnV-NwiX2Ck0UBDnVeuoiC=r8A at mail.gmail.com
>     <mailto:r8A at mail.gmail.com>>:
>     >> > Hi,
>     >> >
>     >> > I have a four node cluster that uses iLo as fencing agent. When i
>     >> simulate
>     >> > a node crash (either killing corosync or echo c >
>     /proc/sysrq-trigger)
>     >> the
>     >> > node is marked as UNCLEAN and requested to be restarted by
>     the stonith
>     >> > agent, but everytime that happens another node in the cluster
>     is also
>     >> > marked as UNCLEAN and rebooted as well. After the nodes are
>     rebooted they
>     >> > are marked as online again and cluster resume operation
>     without problem.
>     >> >
>     >> > I have reviewed corosync and pacemaker logs but found nothing
>     that
>     >> explain
>     >> > why the other node is also rebooted.
>     >> >
>     >> > Any hint of what to check or what to look for would be
>     appreciated.
>     >> >
>     >> > -----------------Cluster conf----------------------------------
>     >> >  node 1239211542: e1b12 \
>     >> > attributes standby=off
>     >> > node 1239211543: e1b13
>     >> > node 1239211581: e1b03 \
>     >> > attributes standby=off
>     >> > node 1239211582: e1b07 \
>     >> > attributes standby=off
>     >> > primitive fence-e1b03 stonith:fence_ilo \
>     >> > params ipaddr=e1b03-ilo login=fence_agent passwd=XXX
>     ssl_insecure=1 \
>     >> > op monitor interval=300 timeout=120 \
>     >> > meta migration-threshold=2 target-role=Started
>     >> > primitive fence-e1b07 stonith:fence_ilo \
>     >> > params ipaddr=e1b07-ilo login=fence_agent passwd=XXX
>     ssl_insecure=1 \
>     >> > op monitor interval=300 timeout=120 \
>     >> > meta migration-threshold=2 target-role=Started
>     >> > primitive fence-e1b12 stonith:fence_ilo \
>     >> > params ipaddr=e1b12-ilo login=fence_agent passwd=XXX
>     ssl_insecure=1 \
>     >> > op monitor interval=300 timeout=120 \
>     >> > meta migration-threshold=2 target-role=Started
>     >> > primitive fence-e1b13 stonith:fence_ilo \
>     >> > params ipaddr=e1b13-ilo login=fence_agent passwd=XXX
>     ssl_insecure=1 \
>     >> > op monitor interval=300 timeout=120 \
>     >> > meta migration-threshold=2 target-role=Started
>     >> > ..... extra resources ......
>     >> > location l-f-e1b03 fence-e1b03 \
>     >> > rule -inf: #uname eq e1b03 \
>     >> > rule 10000: #uname eq e1b07
>     >> > location l-f-e1b07 fence-e1b07 \
>     >> > rule -inf: #uname eq e1b07 \
>     >> > rule 10000: #uname eq e1b03
>     >> > location l-f-e1b12 fence-e1b12 \
>     >> > rule -inf: #uname eq e1b12 \
>     >> > rule 10000: #uname eq e1b13
>     >> > location l-f-e1b13 fence-e1b13 \
>     >> > rule -inf: #uname eq e1b13 \
>     >> > rule 10000: #uname eq e1b12
>     >> > property cib-bootstrap-options: \
>     >> > have-watchdog=false \
>     >> > dc-version=1.1.15-e174ec8 \
>     >> > cluster-infrastructure=corosync \
>     >> > stonith-enabled=true \
>     >> > cluster-name=test-cluster \
>     >> > no-quorum-policy=freeze \
>     >> > last-lrm-refresh=1483125286
>     >> > ------------------------------------------------------------
>     >> ----------------
>     >> > ------------
>     >> >
>     >> > Regards,
>     >> >   Ali
>     >>
>
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org