[Pacemaker] The effects of /var being full on failure detection

Sat Feb 5 02:49:33 EST 2011

Hi,

On 4 February 2011 23:09, Ryan Thomson <ryan at pet.ubc.ca> wrote:
> Hello list,
>
> I've got a question surrounding the behaviour of pacemaker (with heartbeat) when the partition hosting /var becomes full. Hopefully I can explain the situation clearly.
>
> We are running a two-node cluster with pacemaker 1.0.9 with heartbeat 3.0.3 on CentOS 5 x86_64. STONITH is configured with IPMI. We run in an active/passive configuration.
>
> On Wednesday night our active node (resonance) experienced a severe kernel soft lockup issue. Starting around  The soft lockup caused the services running on this node to become inaccessible to the clients. While some of the TCP ports accepted telnet connections and the node was responding to pings but none of the clients were able to access the actual services, including SSH. The first soft lockup occurred around 4:30PM.
>
> Earlier that day (in the wee hours of the morning), /var became full on the passive node (mricenter) causing pengine to experience problems writing to /var:
>
> Feb  2 00:15:36 mricenter pengine: [23556]: ERROR: write_xml_file: bzWriteClose() failed: -6
>
> This was not noticed as our monitoring was inadequate.
>
> Once the soft lockup occurred on the active node and /var on the passive node was full, both heartbeat and pacemaker apparently continued operating as if everything was normal with the cluster. The logs on the passive node did not indicate any loss of heartbeat communication and that the resources controlled by pacemaker were running and presumably returning success from their "monitor" operations:
>
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print: stonith-mricenter     (stonith:external/ipmi):        Started resonance.fakedomain.com
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print: stonith-resonance     (stonith:external/ipmi):        Started mricenter.fakedomain.com
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: clone_print:  Clone Set: ping-clone
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: short_print:      Started: [ mricenter.fakedomain.com resonance.fakedomain.com ]
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: group_print:  Resource Group: DRBD
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print:      DRBD-Disk        (heartbeat:drbddisk):   Started resonance.fakedomain.com
> Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print:      DRBD-Filesystem  (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: group_print:  Resource Group: LUN-HOME
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Home-LVM (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Home-Filesystem  (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: group_print:  Resource Group: LUN-DATA
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Data-LVM (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Workgroup-Filesystem     (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
> Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Mrcntr-Filesystem        (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource Group: LUN-DATABASE
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      Database-LVM     (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      Database-Filesystem      (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource Group: LUN-CHH
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      Chh-LVM  (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      Chh-Filesystem   (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource Group: NFS
> Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      NFSLock  (lsb:nfslock):  Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print:      NFS-Daemon       (lsb:nfs):      Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: Virtual-IP    (ocf::heartbeat:IPaddr2):       Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: Samba-Daemon  (lsb:smb):      Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: SMmonitor-Daemon      (lsb:SMmonitor):        Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: Tina-Backup-Agent     (lsb:tina.tina_ha):     Started resonance.fakedomain.com
> Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: CUPS-Daemon   (lsb:cups):     Started resonance.fakedomain.com
> Feb  2 22:11:34 mricenter pengine: [23556]: notice: native_print: Failover-Email-Alert  (ocf::heartbeat:MailTo):        Started resonance.fakedomain.com
>
> However, only the pacemaker/heartbeat logs on the passive node continued as per normal. On the active and soft lockup'ed node, the pacemaker log output abruptly stopped once the soft lockup condition had occurred. We did however get this repeating message from heartbeat in the logs:
>
> Feb  2 17:45:46 resonance heartbeat: [8129]: ERROR: 36 messages dropped on a non-blocking channel (send queue maximum length 64)
>
> My question is this: Would /var being full on the passive node have played a role in the cluster's inability to failover during the soft lockup condition on the active node? Or perhaps we hit a condition in which our configuration of pacemaker was unable to detect this type of failure? I'm basically trying to figure out if /var being full on the passive node played a role in the lack of failover or if our configuration is inadequate at detecting the type of failure we experienced.

I'd say absolutely yes. /var being full probably stopped cluster
traffic or at the least, changes to the cib from being accepted (from
memory cib changes are written to temp files in /var/lib/heartbeat/crm/...).

It can certainly stop ssh sessions from being established.

>
> Thoughts?

Just for the list (since I'm sure you've done this or similar already)
I'd suggest you use SNMP monitoring and add an SNMP trap for /var
being 95% full.

A useful addition is to mount /var/log on a different
disk/partition/logical volume from /var, that way even if your logs
fill up, the system should still continue to function for a while.

>
> --
> Ryan Thomson, Systems Administrator, UBC-PET
> UBC Hospital, Koerner Pavilion
> Room G358, 2211 Wesbrook Mall
> Vancouver, BC V6T 2B5
>
> Daytime Tel: 604.822.7605
> Evening Tel: 778.319.4505
> Pager: 604.205.4349 / 6042054349 at msg.telus.com
> Email: ryan at pet.ubc.ca

-- 
Best Regards,

Brett Delle Grazie