[ClusterLabs] Antw: Re: Antw: Re: Antw: [EXT] Cluster breaks after pcs unstandby node

Mon Jan 18 04:18:04 EST 2021

>>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 18.01.2021 um
10:00 in
Nachricht
<CALhdMBi4Q=3xRU8yubFEkX2XgBZf1WLO5+LQzNqCo=Co2e-jRQ at mail.gmail.com>:
> Hi,
> 
> I have persistent journal, but 'journalctl -b -1' was empty in this
> case, so it might not be optimally configured. And centralized logging
> is on the todo list
> 
> 
> btw. about the fencing, I have set ' HandlePowerKey=ignore' in
> /etc/systemd/logind.conf
> (for this hardware, I can find no bios settings on how to react to
> power key being pressed, so can not be set to instant-off)
> 
> Now when a node is fenced it goes down more quickly, and its only
> journal output is:
> Jan 18 09:33:19 kvm03-node03 systemd-logind[4354]: Power key pressed.
> Jan 18 09:33:24 kvm03-node03 systemd-logind[4354]: Power key pressed.
> 
> So it seems it needs to be pressed twice with 5 sec delay, and by
> looking at the hardware console, the system does not reboot before
> about 09.33.27 ( 8 secs totally)

I haven't looked into the IPMI fenceing agent, but ipmitool can:
chassis power on
chassis power off
chassis power cycle
chassis power reset

IMHO for fencing only "power off" and "power reset" (assuming a hardware
reset) make sense.
Also I don't know how it's implemented: My guess is that it directs the power
supply to transit to off, and _not_ to simulate an ACPI power buttoin press...

Playing with the tool here (Dell server), I get:
h16:~ # ipmitool chassis power ## only list commands available
chassis power Commands: status, on, off, cycle, reset, diag, soft
h16:~ # ipmitool chassis restart_cause
System restart cause: unknown

> 
> When the node is back online, 'journalctl -b -1' only reports the first
> Jan 18 09:33:19 kvm03-node03 systemd-logind[4354]: Power key pressed.
> 
> The second line was never written to persistent journal

What might help is running "journalctl -f" on a terminal. So you see the last
messages received, even if not written to the filesystem (I think). So when the
host is down, you see the last messages.
Disk writes frequently miss the last two or three seconds IMHO.

Regards,
Ulrich

> 
> 
> 
> On Mon, Jan 18, 2021 at 8:49 AM Ulrich Windl
> <Ulrich.Windl at rz.uni-regensburg.de> wrote:
>>
>> >>> Steffen Vinther Sørensen <svinther at gmail.com> schrieb am 16.01.2021 um
>> 19:28 in
>> Nachricht
>> <CALhdMBho79Kd7XjV2BvD+-J5i+94vKejnJYB5UEjG=w_hG1Scg at mail.gmail.com>:
>> > Hi and thank you for the insights
>>
>> Hi!
>> ...
>>
>> > I just did a test after the latest adjustments with colocations etc.
>> > trying to standby node02, ends up with node02 being fenced before
>> > migrations complete. Unfortunately logs from node02 was lost
>>
>> Don't you have a persistent journal on node2? Maybe it's a good idea to  
> make
>> all nodes log to an external syslog server, at least until your problems
are
>> fixed. That would also have the benefit that you get a better global
insight 
> of
>> the sequence of events...
>>
>> ...
>>
>> Regards,
>> Ulrich
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/