[Pacemaker] The effects of /var being full on failure detection

Fri Feb 4 18:09:27 EST 2011

Hello list,

I've got a question surrounding the behaviour of pacemaker (with heartbeat) when the partition hosting /var becomes full. Hopefully I can explain the situation clearly.

We are running a two-node cluster with pacemaker 1.0.9 with heartbeat 3.0.3 on CentOS 5 x86_64. STONITH is configured with IPMI. We run in an active/passive configuration.

On Wednesday night our active node (resonance) experienced a severe kernel soft lockup issue. Starting around  The soft lockup caused the services running on this node to become inaccessible to the clients. While some of the TCP ports accepted telnet connections and the node was responding to pings but none of the clients were able to access the actual services, including SSH. The first soft lockup occurred around 4:30PM.

Earlier that day (in the wee hours of the morning), /var became full on the passive node (mricenter) causing pengine to experience problems writing to /var:

Feb  2 00:15:36 mricenter pengine: [23556]: ERROR: write_xml_file: bzWriteClose() failed: -6

This was not noticed as our monitoring was inadequate.

Once the soft lockup occurred on the active node and /var on the passive node was full, both heartbeat and pacemaker apparently continued operating as if everything was normal with the cluster. The logs on the passive node did not indicate any loss of heartbeat communication and that the resources controlled by pacemaker were running and presumably returning success from their "monitor" operations:

Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print: stonith-mricenter     (stonith:external/ipmi):        Started resonance.fakedomain.com
Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print: stonith-resonance     (stonith:external/ipmi):        Started mricenter.fakedomain.com
Feb  2 22:11:30 mricenter pengine: [23556]: notice: clone_print:  Clone Set: ping-clone
Feb  2 22:11:30 mricenter pengine: [23556]: notice: short_print:      Started: [ mricenter.fakedomain.com resonance.fakedomain.com ]
Feb  2 22:11:30 mricenter pengine: [23556]: notice: group_print:  Resource Group: DRBD
Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print:      DRBD-Disk        (heartbeat:drbddisk):   Started resonance.fakedomain.com
Feb  2 22:11:30 mricenter pengine: [23556]: notice: native_print:      DRBD-Filesystem  (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
Feb  2 22:11:31 mricenter pengine: [23556]: notice: group_print:  Resource Group: LUN-HOME
Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Home-LVM (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Home-Filesystem  (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
Feb  2 22:11:31 mricenter pengine: [23556]: notice: group_print:  Resource Group: LUN-DATA
Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Data-LVM (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Workgroup-Filesystem     (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
Feb  2 22:11:31 mricenter pengine: [23556]: notice: native_print:      Mrcntr-Filesystem        (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource Group: LUN-DATABASE
Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      Database-LVM     (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      Database-Filesystem      (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource Group: LUN-CHH
Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      Chh-LVM  (ocf::heartbeat:LVM):   Started resonance.fakedomain.com
Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      Chh-Filesystem   (ocf::heartbeat:Filesystem):    Started resonance.fakedomain.com
Feb  2 22:11:32 mricenter pengine: [23556]: notice: group_print:  Resource Group: NFS
Feb  2 22:11:32 mricenter pengine: [23556]: notice: native_print:      NFSLock  (lsb:nfslock):  Started resonance.fakedomain.com
Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print:      NFS-Daemon       (lsb:nfs):      Started resonance.fakedomain.com
Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: Virtual-IP    (ocf::heartbeat:IPaddr2):       Started resonance.fakedomain.com
Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: Samba-Daemon  (lsb:smb):      Started resonance.fakedomain.com
Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: SMmonitor-Daemon      (lsb:SMmonitor):        Started resonance.fakedomain.com
Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: Tina-Backup-Agent     (lsb:tina.tina_ha):     Started resonance.fakedomain.com
Feb  2 22:11:33 mricenter pengine: [23556]: notice: native_print: CUPS-Daemon   (lsb:cups):     Started resonance.fakedomain.com
Feb  2 22:11:34 mricenter pengine: [23556]: notice: native_print: Failover-Email-Alert  (ocf::heartbeat:MailTo):        Started resonance.fakedomain.com

However, only the pacemaker/heartbeat logs on the passive node continued as per normal. On the active and soft lockup'ed node, the pacemaker log output abruptly stopped once the soft lockup condition had occurred. We did however get this repeating message from heartbeat in the logs:

Feb  2 17:45:46 resonance heartbeat: [8129]: ERROR: 36 messages dropped on a non-blocking channel (send queue maximum length 64)

My question is this: Would /var being full on the passive node have played a role in the cluster's inability to failover during the soft lockup condition on the active node? Or perhaps we hit a condition in which our configuration of pacemaker was unable to detect this type of failure? I'm basically trying to figure out if /var being full on the passive node played a role in the lack of failover or if our configuration is inadequate at detecting the type of failure we experienced.

Thoughts?

-- 
Ryan Thomson, Systems Administrator, UBC-PET
UBC Hospital, Koerner Pavilion
Room G358, 2211 Wesbrook Mall
Vancouver, BC V6T 2B5

Daytime Tel: 604.822.7605
Evening Tel: 778.319.4505
Pager: 604.205.4349 / 6042054349 at msg.telus.com
Email: ryan at pet.ubc.ca