[ClusterLabs] Problem with stonith and starting services

Ken Gaillot kgaillot at redhat.com
Thu Jul 6 10:48:33 EDT 2017


On 07/06/2017 09:26 AM, Klaus Wenninger wrote:
> On 07/06/2017 04:20 PM, Cesar Hernandez wrote:
>>> If node2 is getting the notification of its own fencing, it wasn't
>>> successfully fenced. Successful fencing would render it incapacitated
>>> (powered down, or at least cut off from the network and any shared
>>> resources).
>>
>> Maybe I don't understand you, or maybe you don't understand me... ;)
>> This is the syslog of the machine, where you can see that the machine has rebooted successfully, and as I said, it has been rebooted successfully all the times:
> 
> It is not just a question if it was rebooted at all.
> Your fence-agent mustn't return positively until this definitely
> has happened and the node is down.
> Otherwise you will see that message and the node will try to
> somehow cope with the fact that obviously the rest of the
> cluster thinks that it is down already.

But the "allegedly fenced" message comes in after the node has rebooted,
so it would seem that everything was in the proper sequence.

It looks like a bug when the fenced node rejoins quickly enough that it
is a member again before its fencing confirmation has been sent. I know
there have been plenty of clusters with nodes that quickly reboot and
slow fencing devices, so that seems unlikely, but I don't see another
explanation.

>> Jul  5 10:41:54 node2 kernel: [    0.000000] Initializing cgroup subsys cpuset
>> Jul  5 10:41:54 node2 kernel: [    0.000000] Initializing cgroup subsys cpu
>> Jul  5 10:41:54 node2 kernel: [    0.000000] Initializing cgroup subsys cpuacct
>> Jul  5 10:41:54 node2 kernel: [    0.000000] Linux version 3.16.0-4-amd64 (debian-kernel at lists.debian.org) (gcc version 4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.39-1 (2016-12-30)
>> Jul  5 10:41:54 node2 kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-4-amd64 root=UUID=711e1ec2-2a36-4405-bf46-44b43cfee42e ro init=/bin/systemd console=ttyS0 console=hvc0
>> Jul  5 10:41:54 node2 kernel: [    0.000000] e820: BIOS-provided physical RAM map:
>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003fffffff] usable
>> Jul  5 10:41:54 node2 kernel: [    0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
>> Jul  5 10:41:54 node2 kernel: [    0.000000] NX (Execute Disable) protection: active
>> Jul  5 10:41:54 node2 kernel: [    0.000000] SMBIOS 2.4 present.
>>
>> ...
>>
>> Jul  5 10:41:54 node2 dhclient: DHCPREQUEST on eth0 to 255.255.255.255 port 67
>>
>> ...
>>
>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync Cluster Engine ('UNKNOWN'): started and ready to provide service.
>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Corosync built-in features: nss
>> Jul  5 10:41:54 node2 corosync[585]:   [MAIN  ] Successfully read main configuration file '/etc/corosync/corosync.conf'.
>>
>> ...
>>
>> Jul  5 10:41:57 node2 crmd[608]:   notice: Defaulting to uname -n for the local classic openais (with plugin) node name
>> Jul  5 10:41:57 node2 crmd[608]:   notice: Membership 4308: quorum acquired
>> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node node2[1108352940] - state is now member (was (null))
>> Jul  5 10:41:57 node2 crmd[608]:   notice: plugin_handle_membership: Node node11[794540] - state is now member (was (null))
>> Jul  5 10:41:57 node2 crmd[608]:   notice: The local CRM is operational
>> Jul  5 10:41:57 node2 crmd[608]:   notice: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Watching for stonith topology changes
>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: Membership 4308: quorum acquired
>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: plugin_handle_membership: Node node11[794540] - state is now member (was (null))
>> Jul  5 10:41:57 node2 stonith-ng[604]:   notice: On loss of CCM Quorum: Ignore
>> Jul  5 10:41:58 node2 stonith-ng[604]:   notice: Added 'st-fence_propio:0' to the device list (1 active devices)
>> Jul  5 10:41:59 node2 stonith-ng[604]:   notice: Operation reboot of node2 by node11 for crmd.2141 at node11.61c3e613: OK
>> Jul  5 10:41:59 node2 crmd[608]:     crit: We were allegedly just fenced by node11 for node11!
>> Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client crmd (conn=0x228d970, async-conn=0x228d970) left
>> Jul  5 10:41:59 node2 pacemakerd[597]:  warning: The crmd process (608) can no longer be respawned, shutting the cluster down.
>> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Shutting down Pacemaker
>> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Stopping pengine: Sent -15 to process 607
>> Jul  5 10:41:59 node2 pengine[607]:   notice: Invoking handler for signal 15: Terminated
>> Jul  5 10:41:59 node2 pacemakerd[597]:   notice: Stopping attrd: Sent -15 to process 606
>> Jul  5 10:41:59 node2 attrd[606]:   notice: Invoking handler for signal 15: Terminated
>> Jul  5 10:41:59 node2 attrd[606]:   notice: Exiting...
>> Jul  5 10:41:59 node2 corosync[585]:   [pcmk  ] info: pcmk_ipc_exit: Client attrd (conn=0x2280ef0, async-conn=0x2280ef0) left




More information about the Users mailing list