[Pacemaker] Problems with jboss on pacemaker

Thu May 5 12:39:09 EDT 2011

Hi

Am 05.05.2011 16:35, schrieb Dejan Muhamedagic:
> On Thu, May 05, 2011 at 12:26:57PM +0200, Benjamin Knoth wrote:
>> Hi again,
>>
>> i copied the jboss ocf and modified the variables, that the script use
>> my variables ifi start it. Now if i start the ocf script i get the
>> following everytime.
>>
>> ./jboss-test start
>> jboss-test[6165]: DEBUG: [jboss] Enter jboss start
>> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
>> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
>> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
>>
>> Something is wrong.
> 
> Typically, the start operation includes a monitor at the end to
> make sure that the resource really started. In this case it
> looks like the monitor repeatedly fails. You should check the
> monitor operation. Take a look at the output of "crm ra info
> jboss" for parameters which have effect on monitoring. BTW, you
> can test your resource without cluster using ocf-tester.

I don't find the ocf tester or i don't know how to use them.
The log of jboss says that jboss will started, but it can't deploy some
packages, with the ocf script. The most important is:

18:18:16,654 ERROR [MainDeployer] Could not start deployment:
file:/data/jboss-4.2.2.GA/server/default/tmp/deploy/tmp8457743723406154025escidoc-core.ear-contents/escidoc-core.war
org.jboss.deployment.DeploymentException: URL
file:/data/jboss-4.2.2.GA/server/default/tmp/deploy/tmp8457743723406154025escidoc-core.ear-contents/escidoc-core-exp.war/
deployment failed

--- Incompletely deployed packages ---
org.jboss.deployment.DeploymentInfo at 844a3a10 {
url=file:/data/jboss-4.2.2.GA/server/default/deploy/escidoc-core.ear }
  deployer: org.jboss.deployment.EARDeployer at 40f940f9
  status: Deployment FAILED reason: URL
file:/data/jboss-4.2.2.GA/server/default/tmp/deploy/tmp8457743723406154025escidoc-core.ear-contents/escidoc-core-exp.war/
deployment failed
  state: FAILED
  watch: file:/data/jboss-4.2.2.GA/server/default/deploy/escidoc-core.ear
  altDD: null
  lastDeployed: 1304612289701
  lastModified: 1304612278000
  mbeans:

After 4 minutes Jboss will shutdown from pacemaker.

If i run the init-script normal it runs fine and all important packages
will deploy.

I checked the differnce between processes on start bei init-script and
ocf-script from pacemaker

pacemaker

root     20074  0.0  0.0  12840  1792 ?        S    17:56   0:00 /bin/sh
/usr/lib/ocf/resource.d//heartbeat/jboss start
root     20079  0.0  0.0  48336  1368 ?        S    17:56   0:00 su -
jboss -s /bin/bash -c export JAVA_HOME=/usr/lib64/jvm/java;\n?
                  export JBOSS_HOME=/usr/share/jboss;\n?
            /usr/share/jboss/bin/run.sh -c default
-Djboss.bind.address=0.0.0.0

init-script

root     20079  0.0  0.0  48336  1368 ?        S    17:56   0:00 su
jboss -s /bin/bash -c /usr/share/jboss/bin/run.sh -c default
-Djboss.bind.address=0.0.0.0

Cheers

Benjamin

> 
> Thanks,
> 
> Dejan
> 
>> Cheers
>> Benjamin
>>
>> Am 05.05.2011 12:03, schrieb Benjamin Knoth:
>>> Hi,
>>>
>>> Am 05.05.2011 11:46, schrieb Dejan Muhamedagic:
>>>> On Wed, May 04, 2011 at 03:44:02PM +0200, Benjamin Knoth wrote:
>>>>>
>>>>>
>>>>> Am 04.05.2011 13:18, schrieb Benjamin Knoth:
>>>>>> Hi,
>>>>>>
>>>>>> Am 04.05.2011 12:18, schrieb Dejan Muhamedagic:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Wed, May 04, 2011 at 10:37:40AM +0200, Benjamin Knoth wrote:
>>>>>>>
>>>>>>>
>>>>>>> Am 04.05.2011 09:42, schrieb Florian Haas:
>>>>>>>>>> On 05/04/2011 09:31 AM, Benjamin Knoth wrote:
>>>>>>>>>>> Hi Florian,
>>>>>>>>>>> i test  it with ocf, but i couldn't run.
>>>>>>>>>>
>>>>>>>>>> Well that's really helpful information. Logs? Error messages? Anything?
>>>>>>>
>>>>>>> Logs
>>>>>>>
>>>>>>> May  4 09:55:10 vm36 lrmd: [19214]: WARN: p_jboss_ocf:start process (PID
>>>>>>> 27702) timed out (try 1).  Killing with signal SIGTERM (15).
>>>>>>>
>>>>>>>> You need to set/increase the timeout for the start operation to
>>>>>>>> match the maximum expected start time. Take a look at "crm ra
>>>>>>>> info jboss" for minimum values.
>>>>>>>
>>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: find_hash_entry: Creating
>>>>>>> hash entry for fail-count-p_jboss_ocf
>>>>>>> May  4 09:55:10 vm36 lrmd: [19214]: WARN: operation start[342] on
>>>>>>> ocf::jboss::p_jboss_ocf for client 19217, its parameters:
>>>>>>> CRM_meta_name=[start] crm_feature_set=[3.0.1]
>>>>>>> java_home=[/usr/lib64/jvm/java] CRM_meta_timeout=[240000] jboss_sto
>>>>>>> p_timeout=[30] jboss_home=[/usr/share/jboss] jboss_pstring=[java
>>>>>>> -Dprogram.name=run.sh] : pid [27702] timed out
>>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_trigger_update: Sending
>>>>>>> flush op to all hosts for: fail-count-p_jboss_ocf (INFINITY)
>>>>>>> May  4 09:55:10 vm36 crmd: [19217]: WARN: status_from_rc: Action 64
>>>>>>> (p_jboss_ocf_start_0) on vm36 failed (target: 0 vs. rc: -2): Error
>>>>>>> May  4 09:55:10 vm36 lrmd: [19214]: info: rsc:p_jboss_ocf:346: stop
>>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_perform_update: Sent
>>>>>>> update 2294: fail-count-p_jboss_ocf=INFINITY
>>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
>>>>>>> - p_jboss_lsb_monitor_0 failed with rc=5: Preventing p_jboss_lsb from
>>>>>>> re-starting on vm36
>>>>>>> May  4 09:55:10 vm36 crmd: [19217]: WARN: update_failcount: Updating
>>>>>>> failcount for p_jboss_ocf on vm36 after failed start: rc=-2
>>>>>>> (update=INFINITY, time=1304495710)
>>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: find_hash_entry: Creating
>>>>>>> hash entry for last-failure-p_jboss_ocf
>>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
>>>>>>> p_jboss_cs_monitor_0 found resource p_jboss_cs active on vm36
>>>>>>> May  4 09:55:10 vm36 crmd: [19217]: info: abort_transition_graph:
>>>>>>> match_graph_event:272 - Triggered transition abort (complete=0,
>>>>>>> tag=lrm_rsc_op, id=p_jboss_ocf_start_0,
>>>>>>> magic=2:-2;64:1375:0:fc16910d-2fe9-4daa-834a-348a4c7645ef, cib=0.53
>>>>>>> 5.2) : Event failed
>>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_trigger_update: Sending
>>>>>>> flush op to all hosts for: last-failure-p_jboss_ocf (1304495710)
>>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
>>>>>>> - p_jboss_init_monitor_0 failed with rc=5: Preventing p_jboss_init from
>>>>>>> re-starting on vm36
>>>>>>> May  4 09:55:10 vm36 crmd: [19217]: info: match_graph_event: Action
>>>>>>> p_jboss_ocf_start_0 (64) confirmed on vm36 (rc=4)
>>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_perform_update: Sent
>>>>>>> update 2297: last-failure-p_jboss_ocf=1304495710
>>>>>>> May  4 09:55:10 vm36 pengine: [19216]: WARN: unpack_rsc_op: Processing
>>>>>>> failed op p_jboss_ocf_start_0 on vm36: unknown exec error (-2)
>>>>>>> May  4 09:55:10 vm36 crmd: [19217]: info: te_rsc_command: Initiating
>>>>>>> action 1: stop p_jboss_ocf_stop_0 on vm36 (local)
>>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
>>>>>>> p_jboss_ocf_monitor_0 found resource p_jboss_ocf active on vm37
>>>>>>> May  4 09:55:10 vm36 crmd: [19217]: info: do_lrm_rsc_op: Performing
>>>>>>> key=1:1376:0:fc16910d-2fe9-4daa-834a-348a4c7645ef op=p_jboss_ocf_stop_0 )
>>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
>>>>>>>        (ocf::heartbeat:jboss): Stopped
>>>>>>> May  4 09:55:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
>>>>>>> has failed INFINITY times on vm36
>>>>>>> May  4 09:55:10 vm36 pengine: [19216]: WARN: common_apply_stickiness:
>>>>>>> Forcing p_jboss_ocf away from vm36 after 1000000 failures (max=1000000)
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: unpack_config: Node scores:
>>>>>>> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
>>>>>>> May  4 09:59:10 vm36 crmd: [19217]: WARN: status_from_rc: Action 50
>>>>>>> (p_jboss_ocf_start_0) on vm37 failed (target: 0 vs. rc: -2): Error
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: determine_online_status:
>>>>>>> Node vm36 is online
>>>>>>> May  4 09:59:10 vm36 crmd: [19217]: WARN: update_failcount: Updating
>>>>>>> failcount for p_jboss_ocf on vm37 after failed start: rc=-2
>>>>>>> (update=INFINITY, time=1304495950)
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
>>>>>>> - p_jboss_lsb_monitor_0 failed with rc=5: Preventing p_jboss_lsb from
>>>>>>> re-starting on vm36
>>>>>>> May  4 09:59:10 vm36 crmd: [19217]: info: abort_transition_graph:
>>>>>>> match_graph_event:272 - Triggered transition abort (complete=0,
>>>>>>> tag=lrm_rsc_op, id=p_jboss_ocf_start_0,
>>>>>>> magic=2:-2;50:1377:0:fc16910d-2fe9-4daa-834a-348a4c7645ef, cib=0.53
>>>>>>> 5.12) : Event failed
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
>>>>>>> p_jboss_cs_monitor_0 found resource p_jboss_cs active on vm36
>>>>>>> May  4 09:59:10 vm36 crmd: [19217]: info: match_graph_event: Action
>>>>>>> p_jboss_ocf_start_0 (50) confirmed on vm37 (rc=4)
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
>>>>>>>        (ocf::heartbeat:jboss): Stopped
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
>>>>>>> has failed INFINITY times on vm37
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: WARN: common_apply_stickiness:
>>>>>>> Forcing p_jboss_ocf away from vm37 after 1000000 failures (max=1000000)
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
>>>>>>> has failed INFINITY times on vm36
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: native_color: Resource
>>>>>>> p_jboss_ocf cannot run anywhere
>>>>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: LogActions: Leave
>>>>>>> resource p_jboss_ocf   (Stopped)
>>>>>>> May  4 09:59:31 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
>>>>>>>        (ocf::heartbeat:jboss): Stopped
>>>>>>> ....
>>>>>>>
>>>>>>> Now i don't know how can i reset the resource p_jboss_ocf to test it again.
>>>>>>>
>>>>>>>> crm resource cleanup p_jboss_ocf
>>>>>>
>>>>>> That's the now way, but if i start this command on shell or crm shell in
>>>>>> both i get Cleaning up p_jboss_ocf on vm37
>>>>>> Cleaning up p_jboss_ocf on vm36
>>>>>>
>>>>>> But if i look on the monitoring with crm_mon -1 i getevery time
>>>>>>
>>>>>> Failed actions:
>>>>>> p_jboss_ocf_start_0 (node=vm36, call=-1, rc=1, status=Timed Out):
>>>>>> unknown error
>>>>>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
>>>>>> installed
>>>>>>     p_jboss_ocf_start_0 (node=vm37, call=281, rc=-2, status=Timed Out):
>>>>>> unknown exec error
>>>>>>
>>>>>> p_jboss was deleted in the config yesterday.
>>>>>
>>>>> For demonstration:
>>>>>
>>>>> 15:34:22 ~ # crm_mon -1
>>>>>
>>>>> Failed actions:
>>>>>     p_jboss_ocf_start_0 (node=vm36, call=376, rc=-2, status=Timed Out):
>>>>> unknown exec error
>>>>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
>>>>> installed
>>>>>     p_jboss_ocf_start_0 (node=vm37, call=283, rc=-2, status=Timed Out):
>>>>> unknown exec error
>>>>>
>>>>> 15:35:02 ~ # crm resource cleanup p_jboss_ocf
>>>>> INFO: no curses support: you won't see colors
>>>>> Cleaning up p_jboss_ocf on vm37
>>>>> Cleaning up p_jboss_ocf on vm36
>>>>>
>>>>> 15:39:12 ~ # crm resource cleanup p_jboss
>>>>> INFO: no curses support: you won't see colors
>>>>> Cleaning up p_jboss on vm37
>>>>> Cleaning up p_jboss on vm36
>>>>>
>>>>> 15:39:19 ~ # crm_mon -1
>>>>>
>>>>> Failed actions:
>>>>>     p_jboss_ocf_start_0 (node=vm36, call=376, rc=-2, status=Timed Out):
>>>>> unknown exec error
>>>>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
>>>>> installed
>>>>>     p_jboss_ocf_start_0 (node=vm37, call=283, rc=-2, status=Timed Out):
>>>>> unknown exec error
>>>
>>> Strange, after i edit the config all other Failed actions are deleted
>>> only this Failed actions will be displayed.
>>>
>>> Failed actions:
>>>     p_jboss_ocf_start_0 (node=vm36, call=380, rc=-2, status=Timed Out):
>>> unknown exec error
>>>     p_jboss_ocf_start_0 (node=vm37, call=287, rc=-2, status=Timed Out):
>>> unknown exec error
>>>
>>>>
>>>> Strange, perhaps you ran into a bug here. You can open a bugzilla
>>>> with hb_report.
>>>>
>>>> Anyway, you should fix the timeout issue.
>>>
>>> I know but what sould i do to resolve this issue.
>>>
>>> my config entry for jboss is:
>>>
>>> primitive p_jboss_ocf ocf:heartbeat:jboss \
>>>         params java_home="/usr/lib64/jvm/java"
>>> jboss_home="/usr/share/jboss" jboss_pstring="java -Dprogram.name=run.sh"
>>> jboss_stop_timeout="30" \
>>>         op start interval="0" timeout="240s" \
>>>         op stop interval="0" timeout="240s" \
>>>         op monitor interval="20s"
>>>
>>> In worst case jboss needs max 120s and that's really the worst.
>>>
>>> Cheers,
>>> Benjamin
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Dejan
>>>>
>>>>
>>>>>>>
>>>>>>> And after some tests i have some not  more existing resouces in the
>>>>>>> Failed actions list. How can i delete them?
>>>>>>>
>>>>>>>> The same way.
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>
>>>>>>>> Dejan
>>>>>>>
>>>>>
>>>>> Thx
>>>>>
>>>>> Benjamin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Florian
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>
>> -- 
>> Benjamin Knoth
>> Max Planck Digital Library (MPDL)
>> Systemadministration
>> Amalienstrasse 33
>> 80799 Munich, Germany
>> http://www.mpdl.mpg.de
>>
>> Mail: knoth at mpdl.mpg.de
>> Phone:  +49 89 38602 202
>> Fax:    +49-89-38602-280
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-- 
Benjamin Knoth
Max Planck Digital Library (MPDL)
Systemadministration
Amalienstrasse 33
80799 Munich, Germany
http://www.mpdl.mpg.de

Mail: knoth at mpdl.mpg.de
Phone:  +49 89 38602 202
Fax:    +49-89-38602-280