[ClusterLabs] Antw: Fencing one node kill others
Klaus Wenninger
kwenning at redhat.com
Wed Jan 4 08:19:03 EST 2017
On 01/04/2017 02:06 PM, Alfonso Ali wrote:
> Hi Ulrich,
>
> You're right, it is as if stonithd selected the incorrect device to
> reboot the node. I'm using fence_ilo as the stonith agent, and
> reviewing the params it take is not clear which one (besides the name
> which is irrelevant for stonithd) should be used to fence each node.
>
> In cman+rgmanager you can associate a fence device params to each
> node, for example:
>
> <clusternode name="e1b07" nodeid="2">
> <fence>
> <method name="single">
> <device name="fence_ilo" ipaddr="e1b07-ilo"/>
> </method>
> </fence>
> </clusternode>
>
> what's the equivalent of that in corosync+pacemaker (using crm)??
>
> In general, in a cluster of more than 2 nodes and more than 2 stonith
> devices, how stonithd find which stonith device should be used to
> fence a specific node??
You have the attributes pcmk_host_list & pcmk_host_map to control that.
>
> Regards,
> Ali
>
> On Wed, Jan 4, 2017 at 2:27 AM, Ulrich Windl
> <Ulrich.Windl at rz.uni-regensburg.de
> <mailto:Ulrich.Windl at rz.uni-regensburg.de>> wrote:
>
> Hi!
>
> A few messages that look uncommon to me are:
>
> crm_reap_dead_member: Removing node with name unknown and id
> 1239211543 from membership cache
>
> A bit later the node name is known:
> info: crm_update_peer_proc: pcmk_cpg_membership: Node
> e1b13[1239211543] - corosync-cpg is now offline
>
> Another node seems to go offline also:
> crmd: info: peer_update_callback: Client e1b13/peer now has
> status [offline] (DC=e1b07, changed=4000000)
>
> This looks OK to me:
> stonith-ng: debug: get_capable_devices: Searching through
> 3 devices to see what is capable of action (reboot) for target e1b13
> stonith-ng: debug: stonith_action_create: Initiating action
> status for agent fence_ilo (target=e1b13)
>
> This looks odd to me:
> stonith-ng: debug: stonith_device_execute: Operation status
> for node e1b13 on fence-e1b03 now running with pid=25689, timeout=20s
> stonith-ng: debug: stonith_device_execute: Operation status
> for node e1b13 on fence-e1b07 now running with pid=25690, timeout=20s
> stonith-ng: debug: stonith_device_execute: Operation status
> for node e1b13 on fence-e1b13 now running with pid=25691, timeout=20s
>
> Maybe not, because you can fence the node oin three different ways
> it seems:
> stonith-ng: debug: stonith_query_capable_device_cb: Found 3
> matching devices for 'e1b13'
>
> Now it's geting odd:
> stonith-ng: debug: schedule_stonith_command: Scheduling reboot
> on fence-e1b07 for remote peer e1b07 with op id
> (ae1956b5-ffe1-4d6a-b5a2-c7bba2c6d7fd) (timeout=60s)
> stonith-ng: debug: stonith_action_create: Initiating action
> reboot for agent fence_ilo (target=e1b13)
> stonith-ng: debug: stonith_device_execute: Operation reboot
> for node e1b13 on fence-e1b07 now running with pid=25784, timeout=60s
> crmd: info: crm_update_peer_expected: handle_request:
> Node e1b07[1239211582] - expected state is now down (was member)
> stonith-ng: debug: st_child_done: Operation 'reboot' on
> 'fence-e1b07' completed with rc=0 (2 remaining)
> stonith-ng: notice: log_operation: Operation 'reboot' [25784]
> (call 6 from crmd.1201) for host 'e1b13' with device 'fence-e1b07'
> returned: 0 (OK)
> attrd: info: crm_update_peer_proc: pcmk_cpg_membership: Node
> e1b07[1239211582] - corosync-cpg is now offline
>
> To me it looks as if your STONITH agents kill the wrong node for
> reasons unknown to me.
>
> (didn't inspect the whole logs)
>
> Regards,
> Ulrich
>
>
> >>> Alfonso Ali <alfonso.ali at gmail.com
> <mailto:alfonso.ali at gmail.com>> schrieb am 03.01.2017 um 18:54 in
> Nachricht
> <CANeoTMee-=_-Gtf_vxigKsrXNQ0pWEUAg=7YJHrhvrWDNthsmg at mail.gmail.com
> <mailto:7YJHrhvrWDNthsmg at mail.gmail.com>>:
> > Hi Ulrich,
> >
> > I'm using udpu and static node list. This is my corosync conf:
> >
> > --------------------Corosync
> > configuration-------------------------------------------------
> > totem {
> > version: 2
> > cluster_name: test-cluster
> > token: 3000
> > token_retransmits_before_loss_const: 10
> > clear_node_high_bit: yes
> > crypto_cipher: aes256
> > crypto_hash: sha1
> > transport: udpu
> >
> > interface {
> > ringnumber: 0
> > bindnetaddr: 201.220.222.0
> > mcastport: 5405
> > ttl: 1
> > }
> > }
> >
> > logging {
> > fileline: off
> > to_stderr: no
> > to_logfile: no
> > to_syslog: yes
> > syslog_facility: daemon
> > debug: on
> > timestamp: on
> > logger_subsys {
> > subsys: QUORUM
> > debug: on
> > }
> > }
> >
> > quorum {
> > provider: corosync_votequorum
> > expected_votes: 3
> > }
> >
> > nodelist {
> > node: {
> > ring0_addr: 201.220.222.62
> > }
> > node: {
> > ring0_addr: 201.220.222.23
> > }
> > node: {
> > ring0_addr: 201.220.222.61
> > }
> > node: {
> > ring0_addr: 201.220.222.22
> > }
> > }
> > --------------------------------/Corosync
> > conf-------------------------------------------------------
> >
> > The pacemaker log is very long, i'm sending it attached as a zip
> file,
> > don't know if the list will allow it, if not please tell me
> which sections
> > (stonith, crmd, lrmd, attrd, cib) should i post.
> >
> > For a better understanding, the cluster have 4 nodes, e1b03,
> e1b07, e1b12
> > and e1b13. I simulated a crash on e1b13 with:
> >
> > echo c > /proc/sysrq-trigger
> >
> > the cluster detected e1b13 as crashed and reboot it, but after
> that e1b07
> > was restarted too, and later e1b03 did the same, the only node
> that remined
> > alive was e1b12. The attached log was taken from that node.
> >
> > Let me know if any other info is needed to debug the problem.
> >
> > Regards,
> > Ali
> >
> >
> >
> > On Mon, Jan 2, 2017 at 3:30 AM, Ulrich Windl <
> > Ulrich.Windl at rz.uni-regensburg.de
> <mailto:Ulrich.Windl at rz.uni-regensburg.de>> wrote:
> >
> >> Hi!
> >>
> >> Seeing the detailed log of events would be helpful. Despite of
> that we had
> >> a similar issue with using multicast (and after adding a new
> node to an
> >> existing cluster). Switching to UDPU helped in our case, but
> unless we see
> >> the details, it's all just guessing...
> >>
> >> Ulrich
> >> P.S. A good new year to everyone!
> >>
> >> >>> Alfonso Ali <alfonso.ali at gmail.com
> <mailto:alfonso.ali at gmail.com>> schrieb am 30.12.2016 um 21:40 in
> >> Nachricht
> >>
> <CANeoTMcuNGw_T9e4WNEEK-nmHnV-NwiX2Ck0UBDnVeuoiC=r8A at mail.gmail.com
> <mailto:r8A at mail.gmail.com>>:
> >> > Hi,
> >> >
> >> > I have a four node cluster that uses iLo as fencing agent. When i
> >> simulate
> >> > a node crash (either killing corosync or echo c >
> /proc/sysrq-trigger)
> >> the
> >> > node is marked as UNCLEAN and requested to be restarted by
> the stonith
> >> > agent, but everytime that happens another node in the cluster
> is also
> >> > marked as UNCLEAN and rebooted as well. After the nodes are
> rebooted they
> >> > are marked as online again and cluster resume operation
> without problem.
> >> >
> >> > I have reviewed corosync and pacemaker logs but found nothing
> that
> >> explain
> >> > why the other node is also rebooted.
> >> >
> >> > Any hint of what to check or what to look for would be
> appreciated.
> >> >
> >> > -----------------Cluster conf----------------------------------
> >> > node 1239211542: e1b12 \
> >> > attributes standby=off
> >> > node 1239211543: e1b13
> >> > node 1239211581: e1b03 \
> >> > attributes standby=off
> >> > node 1239211582: e1b07 \
> >> > attributes standby=off
> >> > primitive fence-e1b03 stonith:fence_ilo \
> >> > params ipaddr=e1b03-ilo login=fence_agent passwd=XXX
> ssl_insecure=1 \
> >> > op monitor interval=300 timeout=120 \
> >> > meta migration-threshold=2 target-role=Started
> >> > primitive fence-e1b07 stonith:fence_ilo \
> >> > params ipaddr=e1b07-ilo login=fence_agent passwd=XXX
> ssl_insecure=1 \
> >> > op monitor interval=300 timeout=120 \
> >> > meta migration-threshold=2 target-role=Started
> >> > primitive fence-e1b12 stonith:fence_ilo \
> >> > params ipaddr=e1b12-ilo login=fence_agent passwd=XXX
> ssl_insecure=1 \
> >> > op monitor interval=300 timeout=120 \
> >> > meta migration-threshold=2 target-role=Started
> >> > primitive fence-e1b13 stonith:fence_ilo \
> >> > params ipaddr=e1b13-ilo login=fence_agent passwd=XXX
> ssl_insecure=1 \
> >> > op monitor interval=300 timeout=120 \
> >> > meta migration-threshold=2 target-role=Started
> >> > ..... extra resources ......
> >> > location l-f-e1b03 fence-e1b03 \
> >> > rule -inf: #uname eq e1b03 \
> >> > rule 10000: #uname eq e1b07
> >> > location l-f-e1b07 fence-e1b07 \
> >> > rule -inf: #uname eq e1b07 \
> >> > rule 10000: #uname eq e1b03
> >> > location l-f-e1b12 fence-e1b12 \
> >> > rule -inf: #uname eq e1b12 \
> >> > rule 10000: #uname eq e1b13
> >> > location l-f-e1b13 fence-e1b13 \
> >> > rule -inf: #uname eq e1b13 \
> >> > rule 10000: #uname eq e1b12
> >> > property cib-bootstrap-options: \
> >> > have-watchdog=false \
> >> > dc-version=1.1.15-e174ec8 \
> >> > cluster-infrastructure=corosync \
> >> > stonith-enabled=true \
> >> > cluster-name=test-cluster \
> >> > no-quorum-policy=freeze \
> >> > last-lrm-refresh=1483125286
> >> > ------------------------------------------------------------
> >> ----------------
> >> > ------------
> >> >
> >> > Regards,
> >> > Ali
> >>
>
>
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list