[Pacemaker] Problem with configuring stonith rcd_serial

Fri Oct 29 10:52:43 EDT 2010

On 29 Oct 2010 14:43, Dejan Muhamedagic wrote:
>> stonith -t rcd_serial -p "test /dev/ttyS0 rts 2000" test
>> ** (process:21181): DEBUG: rcd_serial_set_config:called
>> Alarm clock
>> ==> RESET WORKS!
>>
>> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0"
>> dtr\|rts="rts" msduration="2000" -S
>> ** (process:28054): DEBUG: rcd_serial_set_config:called
>> stonith: rcd_serial device OK.
>>
>> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0"
>> dtr\|rts="rts" msduration="2000" -l
>> ** (process:27543): DEBUG: rcd_serial_set_config:called
>> node1 node2
>>
>> stonith -t rcd_serial hostlist='node1 node2' ttydev="/dev/ttyS0"
>> dtr\|rts="rts" msduration="2000" -T reset node2
>> ** (process:29624): DEBUG: rcd_serial_set_config:called
>> ** (process:29624): CRITICAL **: rcd_serial_reset_req: host 'node2' not
>> in hostlist.
>>
> And this message never appears in the logs?
>
Not in /var/log/messages
>> ==> RESET FAILED
>>
>> stonith -t rcd_serial hostlist='node1, node2' ttydev="/dev/ttyS0"
>> dtr\|rts="rts" msduration="2000" -T reset node2
>> ** (process:26929): DEBUG: rcd_serial_set_config:called
>> ** (process:26929): CRITICAL **: rcd_serial_reset_req: host 'node2' not
>> in hostlist.
>> ==> RESET FAILED (notice: hostlist is comma separated here)
>>
>> stonith -t rcd_serial hostlist="node1 node2" ttydev="/dev/ttyS0"
>> dtr\|rts="rts" msduration="2000" -T reset "node1 node2"
>> ==> RESET WORKS, BUT the argument <<reset "node1 node2">> is shit...
>> ==> There seems to be a problem with parsing the host list!
>>
> It turns out that the hostlist can contain just one node. That
> makes sense since you can reach only one host over the serial
> cable. The plugin also makes no effort to tell the user if the
> hostlist looks meaningful, i.e. it considers "node1 node2" as a
> node name (as you've shown above).
>
> So, you'll need to configure two stonith resources, one per node.
>
Very good idea! That brought me a bit foreward:

Now, I used the patched rcd_serial.so with dtr_rts instead of dtr|rts
and the following config:

primitive stonith1 stonith:rcd_serial \
        params hostlist="node2" ttydev="/dev/ttyS0" dtr_rts="rts"
msduration="2000" \
        op monitor interval="60s"
primitive stonith2 stonith:rcd_serial \
        params hostlist="node1" ttydev="/dev/ttyS0" dtr_rts="rts"
msduration="2000" \
        op monitor interval="60s"
location stonith1-loc stonith1 \
        rule $id="stonith1-loc-id" -inf: #uname eq node2
location stonith2-loc stonith2 \
        rule $id="stonith2-loc-id" -inf: #uname eq node1

Then, I said 'kill -9 <corosync_pid> ' on node2, and stonith on node1
really initiated a REBOOT of node2!

BUT in /var/log/messages of node1, stonith-ng thinks that the operation
failed:

Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
not parse (0 2): ** (process:12139): DEBUG: rcd_serial_set_config:called
Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
not parse (3 19): (process:12139): DEBUG: rcd_serial_set_config:called
Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
not parse (0 0):
Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
not parse (0 2): ** (process:12141): DEBUG: rcd_serial_set_config:called
Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
not parse (3 19): (process:12141): DEBUG: rcd_serial_set_config:called
Oct 29 16:06:55 node1 stonith-ng: [31449]: WARN: parse_host_line: Could
not parse (0 0):
Oct 29 16:06:55 node1 pengine: [31454]: WARN: process_pe_message:
Transition 29: WARNINGs found during PE processing. PEngine Input stored
in: /var/lib/pengine/pe-warn-10.bz2
Oct 29 16:06:55 node1 stonith: rcd_serial device not accessible.
Oct 29 16:06:55 node1 stonith-ng: [31449]: notice: log_operation:
Operation 'monitor' [12143] for device 'stonith2' returned: 1
Oct 29 16:06:55 node1 crmd: [31455]: WARN: status_from_rc: Action 118
(stonith2_monitor_60000) on node1 failed (target: 0 vs. rc: 1): Error
Oct 29 16:06:55 node1 crmd: [31455]: WARN: update_failcount: Updating
failcount for stonith2 on node1 after failed monitor: rc=1
(update=value++, time=1288361215)
Oct 29 16:06:57 node1 kernel: [23312.814010] r8169 0000:02:00.0: eth0:
link down
Oct 29 16:06:57 node1 stonith-ng: [31449]: ERROR: log_operation:
Operation 'reboot' [12142] for host 'node2' with device 'stonith1'
returned: 1 (call 0 from (null))

The state remained unclean:
# crm_mon
Node node2: UNCLEAN (offline)
Online: [ node1 ]

That caused multiple reboots of node2, until I deactivated stonith. (The
message "Operation 'reboot' ... returned: 1" repeated each time.)

After that, the state became clean.

So, we are a big step forward, but not at the finish...

Thank you,
  Eberhard

------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------