[Pacemaker] 1.1.10 problems on CentOS 6.5

Thu Dec 12 13:39:37 UTC 2013

I was successfully running 1.1.8 on a pair of CentOS 6.4 servers and 
after updating to CentOS 6.5 and 1.1.10, pacemaker miss-behaves.

The first symptoms appeared with the 1.1.10-14.el6 packages. About 20 
hours after the upgrade, the first drbd_monitor issues came out.

Dec 09 18:50:12 Updated: pacemaker-libs-1.1.10-14.el6.x86_64
Dec 09 18:50:13 Updated: pacemaker-cli-1.1.10-14.el6.x86_64
Dec 09 18:50:13 Updated: pacemaker-cluster-libs-1.1.10-14.el6.x86_64
Dec 09 18:50:13 Updated: pacemaker-1.1.10-14.el6.x86_64

Dec 10 15:27:55 ysmha01 lrmd[3076]:  warning: child_timeout_callback: 
drbd_export_monitor_29000 process (PID 19608) timed out
Dec 10 15:27:55 ysmha01 lrmd[3076]:  warning: operation_finished: 
drbd_export_monitor_29000:19608 - timed out after 20000ms
Dec 10 15:27:55 ysmha01 crmd[3079]:    error: process_lrm_event: LRM 
operation drbd_export_monitor_29000 (77) Timed Out (timeout=20000ms)
Dec 10 15:27:56 ysmha01 crmd[3079]:   notice: process_lrm_event: LRM 
operation drbd_export_notify_0 (call=99, rc=0, cib-update=0, 
confirmed=true) ok

At this point, I tried taking the node to standby and back to online and 
cleaning the resources to no avail. I tried stopping pacemaker without 
luck. I rebooted both servers and on Dec 11, the failure started with 
failure to monitor pingd, then drbd_monitor.

Dec 11 16:12:10 ysmha01 lrmd[3060]:  warning: child_timeout_callback: 
pingd_monitor_20000 process (PID 26237) timed out
Dec 11 16:12:10 ysmha01 lrmd[3060]:  warning: operation_finished: 
pingd_monitor_20000:26237 - timed out after 15000ms
Dec 11 16:12:10 ysmha01 crmd[3063]:    error: process_lrm_event: LRM 
operation pingd_monitor_20000 (35) Timed Out (timeout=15000ms)

Dec 11 16:12:19 ysmha01 lrmd[3060]:  warning: child_timeout_callback: 
drbd_export_monitor_29000 process (PID 26268) timed out
Dec 11 16:12:19 ysmha01 lrmd[3060]:  warning: operation_finished: 
drbd_export_monitor_29000:26268 - timed out after 20000ms
Dec 11 16:12:19 ysmha01 crmd[3063]:    error: process_lrm_event: LRM 
operation drbd_export_monitor_29000 (62) Timed Out (timeout=20000ms)

I upgraded to the latest rpms yesterday afternoon (1.1.10-14.el6_5.1). 
Right before 1 am, there were issues again.

Dec 12 00:49:39 ysmha01 pengine[3149]:   notice: process_pe_message: 
Calculated Transition 41: /var/lib/pacemaker/pengine/pe-input-173.bz2
Dec 12 00:50:03 ysmha01 lrmd[3147]:  warning: child_timeout_callback: 
drbd_export_monitor_29000 process (PID 18496) timed out
Dec 12 00:50:03 ysmha01 lrmd[3147]:  warning: operation_finished: 
drbd_export_monitor_29000:18496 - timed out after 20000ms
Dec 12 00:50:03 ysmha01 crmd[3150]:    error: process_lrm_event: LRM 
operation drbd_export_monitor_29000 (60) Timed Out (timeout=20000ms)

I am for now manually running the machines without pacemaker. What 
suggestions do you have for me? What should I try first?

- Revert to 1.1.8?
- Could be something related to drbd in the new kernel? Downgrade kernel 
rpm?

I can post logs on request, where would be a good place to do that?

Thanks,

Diego