[Pacemaker] Problem with configuring stonith rcd_serial

Tue Nov 2 05:59:37 EDT 2010

On Fri, Oct 29, 2010 at 04:52:43PM +0200, Eberhard Kuemmerle wrote:
> On 29 Oct 2010 14:43, Dejan Muhamedagic wrote:
> >> stonith -t rcd_serial -p "test /dev/ttyS0 rts 2000" test
> >> ** (process:21181): DEBUG: rcd_serial_set_config:called
> >> Alarm clock
> >> ==> RESET WORKS!
> >>
> >> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0"
> >> dtr\|rts="rts" msduration="2000" -S
> >> ** (process:28054): DEBUG: rcd_serial_set_config:called
> >> stonith: rcd_serial device OK.
> >>
> >> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0"
> >> dtr\|rts="rts" msduration="2000" -l
> >> ** (process:27543): DEBUG: rcd_serial_set_config:called
> >> node1 node2
> >>
> >> stonith -t rcd_serial hostlist='node1 node2' ttydev="/dev/ttyS0"
> >> dtr\|rts="rts" msduration="2000" -T reset node2
> >> ** (process:29624): DEBUG: rcd_serial_set_config:called
> >> ** (process:29624): CRITICAL **: rcd_serial_reset_req: host 'node2' not
> >> in hostlist.
> >>
> > And this message never appears in the logs?
> >
> Not in /var/log/messages

Great. That needs fixing too.

> >> ==> RESET FAILED
> >>
> >> stonith -t rcd_serial hostlist='node1, node2' ttydev="/dev/ttyS0"
> >> dtr\|rts="rts" msduration="2000" -T reset node2
> >> ** (process:26929): DEBUG: rcd_serial_set_config:called
> >> ** (process:26929): CRITICAL **: rcd_serial_reset_req: host 'node2' not
> >> in hostlist.
> >> ==> RESET FAILED (notice: hostlist is comma separated here)
> >>
> >> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0"
> >> dtr\|rts="rts" msduration="2000" -T reset "node1 node2"
> >> ==> RESET WORKS, BUT the argument <<reset "node1 node2">> is shit...
> >> ==> There seems to be a problem with parsing the host list!
> >>
> > It turns out that the hostlist can contain just one node. That
> > makes sense since you can reach only one host over the serial
> > cable. The plugin also makes no effort to tell the user if the
> > hostlist looks meaningful, i.e. it considers "node1 node2" as a
> > node name (as you've shown above).
> >
> > So, you'll need to configure two stonith resources, one per node.
> >
> Very good idea! That brought me a bit foreward:
> 
> Now, I used the patched rcd_serial.so with dtr_rts instead of dtr|rts
> and the following config:
> 
> primitive stonith1 stonith:rcd_serial \
>         params hostlist="node2" ttydev="/dev/ttyS0" dtr_rts="rts"
> msduration="2000" \
>         op monitor interval="60s"
> primitive stonith2 stonith:rcd_serial \
>         params hostlist="node1" ttydev="/dev/ttyS0" dtr_rts="rts"
> msduration="2000" \
>         op monitor interval="60s"
> location stonith1-loc stonith1 \
>         rule $id="stonith1-loc-id" -inf: #uname eq node2
> location stonith2-loc stonith2 \
>         rule $id="stonith2-loc-id" -inf: #uname eq node1
> 
> Then, I said 'kill -9 <corosync_pid> ' on node2, and stonith on node1
> really initiated a REBOOT of node2!
> 
> BUT in /var/log/messages of node1, stonith-ng thinks that the operation
> failed:
> 
> Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
> not parse (0 2): ** (process:12139): DEBUG: rcd_serial_set_config:called
> Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
> not parse (3 19): (process:12139): DEBUG: rcd_serial_set_config:called
> Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
> not parse (0 0):
> Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
> not parse (0 2): ** (process:12141): DEBUG: rcd_serial_set_config:called
> Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
> not parse (3 19): (process:12141): DEBUG: rcd_serial_set_config:called
> Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
> not parse (0 0):
> Oct 29 16:06:55 node1 pengine: [31454]: WARN: process_pe_message:
> Transition 29: WARNINGs found during PE processing. PEngine Input stored
> in: /var/lib/pengine/pe-warn-10.bz2
> Oct 29 16:06:55 node1 stonith: rcd_serial device not accessible.

Can't recall seeing this in the logs you posted earlier. This
seems to be a genuine error, perhaps due to some particular
circumstances.

> Oct 29 16:06:55 node1 stonith-ng: [31449]: notice: log_operation:
> Operation 'monitor' [12143] for device 'stonith2' returned: 1
> Oct 29 16:06:55 node1 crmd: [31455]: WARN: status_from_rc: Action 118
> (stonith2_monitor_60000) on node1 failed (target: 0 vs. rc: 1): Error
> Oct 29 16:06:55 node1 crmd: [31455]: WARN: update_failcount: Updating
> failcount for stonith2 on node1 after failed monitor: rc=1
> (update=value++, time=1288361215)
> Oct 29 16:06:57 node1 kernel: [23312.814010] r8169 0000:02:00.0: eth0:
> link down
> Oct 29 16:06:57 node1 stonith-ng: [31449]: ERROR: log_operation:
> Operation 'reboot' [12142] for host 'node2' with device 'stonith1'
> returned: 1 (call 0 from (null))

When you ran -T reset on the command line, did you pay attention
to the exit code returned by the command? Did it exit with code 1
or 0? To me it looked like it exited with 0, but can you please
check that (echo $?). Please also check that the "stonith ... -S"
exits with 0.

If both exit with 0 and report the right thing, then please run
again the test with the cluster. Make sure first that the monitor
operation on the stonith resources succeeds. Then try the fencing
operation. If either of them fails again, then please open a bug
report and include hb_report.

Thanks,

Dejan

> The state remained unclean:
> # crm_mon
> Node node2: UNCLEAN (offline)
> Online: [ node1 ]
> 
> That caused multiple reboots of node2, until I deactivated stonith. (The
> message "Operation 'reboot' ... returned: 1" repeated each time.)
> 
> After that, the state became clean.
> 
> So, we are a big step forward, but not at the finish...
> 
> Thank you,
>   Eberhard
> 
> 
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
> Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
> ------------------------------------------------------------------------------------------------
> ------------------------------------------------------------------------------------------------
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker