[Pacemaker] ocfs2_controld.pcmk process issue

Andrew Beekhof andrew at beekhof.net
Tue May 15 20:34:39 EDT 2012


Is this on SLES by any chance?
SUSE are about the only ones with knowledge in this area I'm afraid.

On Tue, May 15, 2012 at 6:01 AM, Matthew O'Connor <matt at ecsorl.com> wrote:
> Hi!
>
> I ran into the issue of ocfs2_controld.pcmk consuming vast CPU again -
> twice, actually.  The most recent happenstance was after a multi-node
> failure.  One node stayed alive, two nodes had to be rebooted.  After
> the reboots, one of the two came back without issue, and was able to
> mount the OCFS2 stores.  The second node exhibited high-cpu usage on the
> ocfs2_controld.pcmk process, and could not mount the OCFS2 stores.  The
> logs were being voraciously filled with the following message:
>
>   ocfs2_controld: Unable to open checkpoint "ocfs2:controld": Object
> does not exist
>
> This message was being output so frequently that syslogd was starting to
> rate-limit it.  I suspect this accounts for the high CPU usage.  After
> restarting the troubled node several times, I found the solution was to
> order the OCFS2/DLM resource group to stop, cluster-wide, and then
> restart it.  Normal behavior followed.  (In a prior post to the list, I
> referenced hard-killing the ocfs2_controld.pcmk process.  This was a
> more graceful shutdown.)
>
> Attached are two strace outputs.  I'm sorry I'm not very familiar with
> strace, so the value of these files may be questionable.  If there is
> anything else I can provide the next time this happens, I'd be happy to
> do so!  The log-f.txt file was generated with the -f option, and the
> log-fc.txt file was generated with -f -c.
>
> Here also is a snippet from the syslog, during the cluster-wide shutdown
> of the OCFS2/DLM group:
>
> May 14 15:22:13 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:14  ocfs2_controld: last message repeated 199 times
> May 14 15:22:15 gw05 o2cb[4134]: INFO: Stopping ocfs2_controld.pcmk
> May 14 15:22:16 gw05 dlm_controld.pcmk: [3411]: notice:
> terminate_ais_connection: Disconnecting from AIS
> May 14 15:22:16 gw05 lrmd: [2993]: info: RA output:
> (p_dlm:2:stop:stderr) dlm_controld.pcmk: no process found
> May 14 15:22:19 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:20  ocfs2_controld: last message repeated 199 times
> May 14 15:22:25 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:26  ocfs2_controld: last message repeated 199 times
> May 14 15:22:31 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:32  ocfs2_controld: last message repeated 199 times
> May 14 15:22:37 gw05 ocfs2_controld: Unable to open checkpoint
> "ocfs2:controld": Object does not exist
> May 14 15:22:38  ocfs2_controld: last message repeated 199 times
>
> One other interesting bit of log (well, to me), was this bit that
> occurred when I tried to manually mount the OCFS2 store on the afflicted
> server:
>
>   mount.ocfs2: Unable to access cluster service while trying to join
> the group
>
> One other note - I discovered I had not specified a monitor for either
> the pacemaker:o2cb or the pacemaker:controld RA.  Could that have
> possibly triggered this issue?
>
> --
>
> Sincerely,
>  Matthew O'Connor
>
> -----------------------------------------------------------------
> Sr. Software Engineer
> PGP/GPG Key: 0x55F981C4
> Fingerprint: E5DC A0F8 5A40 E4DA 2CE6 B5A2 014C 2CBF 55F9 81C4
>
> Engineering and Computer Simulations, Inc.
> 11825 High Tech Ave Suite 250
> Orlando, FL 32817
>
> Tel:   407-823-9991 x315
> Fax:   407-823-8299
> Email: matt at ecsorl.com
> Web:   www.ecsorl.com
> -----------------------------------------------------------------
>
> CONFIDENTIAL NOTICE: The information contained in this electronic
> message is legally privileged, confidential and exempt from disclosure
> under applicable law. It is intended only for the use of the individual
> or entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution
> or copying of this message is strictly prohibited. If you have received
> this communication in error, please notify the sender immediately by
> return e-mail and delete the original message and any copies of it from
> your computer system. Thank you.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>




More information about the Pacemaker mailing list