[ClusterLabs] Q: "(bnxt_en): transmit queue 2 timed out"

Thu Oct 7 09:22:12 EDT 2021

Hi!

This is not strictly a cluster question, but the problem occurred while the cluster was live-migrating Xen PVMs.
We have a cluster of three Dell PowerEdge R7415 servers featuring the following hardware (and running SLES15 SP2):

* Two "Broadcom BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller" ("bnxt_en")
* Two "Broadcom NetXtreme BCM5720 2-port Gigabit Ethernet PCIe" ("tg3")
* Two "QLogic QLE2690 Single Port 16Gb FC to PCIe Gen3 x8 Adapter"
* One "Dell PERC H730P Mini" ("MegaRAID SAS-3 3108 [Invader]" / "megaraid_sas") providing a RAID1 for the OS from two SSDs.
* One AMD EPYC 7401P 24-Core Processor (48 logical CPUs)
* VM Images are located on FC-SAN-provided OCFS2

The system ran stable for some weeks, but during live migration in the night the following kernel error popped up:
Oct 05 20:13:25 h19 kernel: ------------[ cut here ]------------
Oct 05 20:13:25 h19 kernel: NETDEV WATCHDOG: p4p1 (bnxt_en): transmit queue 2 timed out
Oct 05 20:13:25 h19 kernel: WARNING: CPU: 27 PID: 0 at ../net/sched/sch_generic.c:443 dev_watchdog+0x253/0x260
Oct 05 20:13:25 h19 kernel: Modules linked in: edd st sr_mod cdrom lp parport_pc ppdev parport xfrm_user xfrm_algo xsk_diag sctp_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag joydev drbd>
Oct 05 20:13:25 h19 kernel:  ib_core dcdbas(XX) crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper pcspkr mgag200 drm_vram_helper i2c_algo_bit ttm drm_kms_helper drm syscopyarea sysfil>
Oct 05 20:13:25 h19 kernel: Supported: Yes, External
Oct 05 20:13:25 h19 kernel: CPU: 27 PID: 0 Comm: swapper/27 Tainted: G           OE  X    5.3.18-24.70.1.22605.0.PTF.1186731-default #1 SLE15-SP2
...
Oct 05 20:13:25 h19 kernel: Call Trace:
Oct 05 20:13:25 h19 kernel:  <IRQ>
Oct 05 20:13:25 h19 kernel:  ? pfifo_fast_reset+0x110/0x110
Oct 05 20:13:25 h19 kernel:  call_timer_fn+0x2d/0x130
Oct 05 20:13:25 h19 kernel:  run_timer_softirq+0x40a/0x440
Oct 05 20:13:25 h19 kernel:  ? handle_irq_event_percpu+0x51/0x70
Oct 05 20:13:25 h19 kernel:  ? handle_percpu_irq+0x37/0x50
Oct 05 20:13:25 h19 kernel:  __do_softirq+0xe3/0x2dc
Oct 05 20:13:25 h19 kernel:  irq_exit+0xd5/0xe0
Oct 05 20:13:25 h19 kernel:  xen_evtchn_do_upcall+0x2c/0x40
Oct 05 20:13:25 h19 kernel:  xen_do_hypervisor_callback+0x29/0x40
Oct 05 20:13:25 h19 kernel:  </IRQ>
Oct 05 20:13:25 h19 kernel: RIP: e030:xen_hypercall_sched_op+0xa/0x20
Oct 05 20:13:25 h19 kernel:  ? xen_hypercall_sched_op+0xa/0x20
Oct 05 20:13:25 h19 kernel:  ? xen_safe_halt+0xc/0x20
Oct 05 20:13:25 h19 kernel:  ? default_idle+0x1c/0x140
Oct 05 20:13:25 h19 kernel:  ? do_idle+0x1bb/0x270
Oct 05 20:13:25 h19 kernel:  ? cpu_startup_entry+0x19/0x20
Oct 05 20:13:25 h19 kernel:  ? cpu_bringup_and_idle+0x6a/0x90
Oct 05 20:13:25 h19 kernel:  ? asm_cpu_bringup_and_idle+0x5/0x1000
...
Oct 05 20:13:25 h19 kernel: bnxt_en 0000:84:00.0 p4p1: TX timeout detected, starting reset task!
Oct 05 20:13:25 h19 kernel: bnxt_en 0000:84:00.0 p4p1: Resp cmpl intr err msg: 0x51
Oct 05 20:13:25 h19 kernel: bnxt_en 0000:84:00.0 p4p1: hwrm_ring_free type 1 failed. rc:fffffff0 err:0
Oct 05 20:13:25 h19 kernel: bnxt_en 0000:84:00.0 p4p1: Resp cmpl intr err msg: 0x51
Oct 05 20:13:25 h19 kernel: bnxt_en 0000:84:00.0 p4p1: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
Oct 05 20:13:25 h19 kernel: bnxt_en 0000:84:00.0 p4p1: Resp cmpl intr err msg: 0x51
Oct 05 20:13:25 h19 kernel: bnxt_en 0000:84:00.0 p4p1: hwrm_ring_free type 2 failed. rc:fffffff0 err:0
...
Oct 05 20:13:25 h19 kernel: BUG: workqueue lockup - pool cpus=21 node=0 flags=0x0 nice=-20 stuck for 40s!
...
Oct 05 20:13:25 h19 kernel: BUG: workqueue lockup - pool cpus=21 node=0 flags=0x0 nice=-20 stuck for 72s!

### Note the time stamps for all those messages are identical, probably due to journal daemon having a problem, see below)
...
Oct 05 20:13:25 h19 kernel: BUG: workqueue lockup - pool cpus=21 node=0 flags=0x0 nice=-20 stuck for 104s!
...
Oct 05 20:13:25 h19 kernel: BUG: workqueue lockup - pool cpus=21 node=0 flags=0x0 nice=-20 stuck for 136s!
...
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#831 CDB: Write(10) 2a 00 0b ee 4c 60 00 00 08 00
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#830 CDB: Write(10) 2a 00 0b ee 4c b0 00 00 08 00
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#824 CDB: Write(10) 2a 00 0b e5 d3 68 00 00 40 00
...
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#768 CDB: Write(10) 2a 00 0b ee 49 10 00 00 08 00
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#768 OCR is requested due to IO timeout!!
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#768 SCSI host state: 5  SCSI host busy: 59  FW outstanding: 59
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#768 scmd: (0x00000000148f896f)  retries: 0x0  allowed: 0x5
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#768 CDB: Write(10) 2a 00 0b ee 49 10 00 00 08 00
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#768 Request descriptor details:
Oct 05 20:13:25 h19 kernel: sd 0:2:0:0: [sda] tag#768 RequestFlags:0xe  MSIxIndex:0x2c  SMID:0x301  LMID:0x0  DevHandle:0x0
Oct 05 20:13:25 h19 kernel: IO request frame:
Oct 05 20:13:25 h19 kernel: 00000000: f1000000 00000000 00000000 581d2000 00600002 00000020 00000000 00001000
...
Oct 05 20:13:25 h19 kernel: megaraid_sas 0000:c1:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
Oct 05 20:13:25 h19 kernel: megaraid_sas 0000:c1:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
Oct 05 20:13:25 h19 kernel: BUG: workqueue lockup - pool cpus=21 node=0 flags=0x0 nice=-20 stuck for 168s!
...
Oct 05 20:13:25 h19 systemd-coredump[26759]: Process 3321 (systemd-journal) of user 0 dumped core.
...
Oct 05 20:12:52 h19 systemd[1]: systemd-journald.service: Watchdog timeout (limit 3min)!

### So probably journal hung first, was signaled, creating those messages that were logged once jpournald had restarted
...
Oct 05 20:16:37 h19 pacemaker-controld[7378]:  error: Result of stop operation for prm_xen_test-jeos6 on h19: Timed Out
Oct 05 20:16:37 h19 libvirtd[9267]: End of file while reading data: Input/output error
...

Still the system had a problem causing a node fence short time later.

My questions are: Did anybody else see similar messages like "(bnxt_en): transmit queue 2 timed out"?

My initial guess was that the NIC problem came first, causing a megaraid problem (disk write error), and in turn causing journald to be stuck.
Maybe it was a different order.

The only occasion I know that is similar is my home PC: It has a NIC that Linux _always_ reports transmit queue timeouts (no networking possible) when Windows was running before (and I rebooted into Linux via dual-boot).
The only fix I know is a power cycle. Hardware Reset-Button does not fix the problem.

Regards,
Ulrich

P.S. In the meantime I've updated the BIOS, firmware and OS, but the BIOS/firmware updates were not listing any of those problems.
At the time of the problem the driver was:
h19:~ # ethtool -i p4p1
driver: bnxt_en
version: 1.10.1
firmware-version: 218.0.169.2/pkg 21.80.16.95
...