[ClusterLabs] Antw: Heads up for ldirectord in SLES12 SP5 "Use of uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord line 1830"

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Tue Aug 9 04:49:29 EDT 2022


Hi!

Digging further in ldirectord, I found that the utility functions do no make a difference between a name that is not known, and a name that is (probably) known, but cannot be resolved at the moment.

I hacked the corresponding functions to observe and return the error code (errno) as a negative number.
Short demo:

  DB<2> x ld_gethostbyname('x',AF_INET)
0  '-2' # ENOENT
  DB<3> x ld_gethostbyname('localhost',AF_INET)
0  '127.0.0.1'
  DB<4> x ld_gethostbyname('localhost',AF_INET6)
0  '[::1]'
### hacked /etc/resolv.conf to make namservers unreachable (add IPs that are no nameservers or don't exist), but host exists
  DB<7> x ld_gethostbyname('mail-1',AF_INET)
0  '-3' # ESRCH

Returning the error message string is a bit trickier, so I just used the error code.

However it's not clear what to do when the resolver fails (i.e.: name would be known if resolver worked). In any case it takes quite a while until an error result is returned.

For example (using the hacked functions):
        if (($fallback->{port} =
             &ld_getservbyname($fallback->{port}, $protocol)) =~ /^-/) {
                &config_error($line, "invalid port for fallback server");
        }

One could check for "== '-2'" instead, but still in the other case there is no valid port value.

Ideas?

Regards,
Ulrich

>>> Ulrich Windl schrieb am 08.08.2022 um 11:19 in Nachricht <62F0D518.3F8 : 161 :
60728>:
> Hi!
> 
> The bug is still under investigation, but digging in the ldirectord code I 
> found this part called when stopping:
> 
>                 } elsif ($CMD eq "stop") {
>                         kill 15, $oldpid;
>                         ld_exit(0, "Exiting from ldirectord $CMD");
> 
> As ldirectord uses a SIGTERM handler that sets a flag only and then (at some 
> later time) the termination code will be started.
> Doesn't that mean the cluster will see a bad exit code (success while parts 
> of ldirectord are still running)?
> 
> Regards,
> Ulrich
> 
> 
> 
> >>> Ulrich Windl schrieb am 03.08.2022 um 11:13 in Nachricht <62EA3C2C.E8D : 161 
> :
> 60728>:
> > Hi!
> > 
> > I wanted to inform you of an unpleasant bug in ldirectord of SLES12 SP5:
> > We had a short network problem while some redundancy paths reconfigured in 
> > the infrastructure, effectively causing that some network services could 
> not 
> > be reached.
> > Unfortunately ldirectord controlled by the cluster reported a failure (the 
> > director, not the services being directed to):
> > 
> > h11 crmd[28930]:   notice: h11-prm_lvs_mail_monitor_300000:369 [ Use of 
> > uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord 
> 
> > line 1830, <CFGFILE> line 21. Error [33159] reading file 
> > /etc/ldirectord/mail.conf at line 10: invalid address for virtual service\n 
> ]
> > h11 ldirectord[33266]: Exiting with exit_status 2: config_error: 
> > Configuration Error
> > 
> > You can guess wat happened:
> > Pacemaker tried to recover (stop, then start), but the stop failed, too:
> > h11 lrmd[28927]:   notice: prm_lvs_mail_stop_0:35047:stderr [ Use of 
> > uninitialized value $ip_port in pattern match (m//) at /usr/sbin/ldirectord 
> 
> > line 1830, <CFGFILE> line 21. ]
> > h11 lrmd[28927]:   notice: prm_lvs_mail_stop_0:35047:stderr [ Error [36293] 
> 
> > reading file /etc/ldirectord/mail.conf at line 10: invalid address for 
> > virtual service ]
> > h11 crmd[28930]:   notice: Result of stop operation for prm_lvs_mail on 
> h11: 
> > 1 (unknown error)
> > 
> > A stop failure meant that the node was fenced, interrupting all the other 
> > services.
> > 
> > Examining the logs I also found this interesting type of error:
> > h11 attrd[28928]:   notice: Cannot update 
> > fail-count-prm_lvs_rksapds5#monitor_300000[monitor]=(null) because peer 
> UUID 
> > not known (will retry if learned)
> > 
> > Eventually, here's the code that caused the error:
> > 
> > sub _ld_read_config_virtual_resolve
> > {
> >         my($line, $vsrv, $ip_port, $af)=(@_);
> > 
> >         if($ip_port){
> >                 $ip_port=&ld_gethostservbyname($ip_port, $vsrv->{protocol}, 
> > $af);
> >                 if ($ip_port =~ /(\[[0-9A-Fa-f:]+\]):(\d+)/) {
> >                         $vsrv->{server} = $1;
> >                         $vsrv->{port} = $2;
> >                 } elsif($ip_port){
> >                         ($vsrv->{server}, $vsrv->{port}) = split /:/, 
> > $ip_port;
> >                 }
> >                 else {
> >                         &config_error($line,
> >                                 "invalid address for virtual service");
> >                 }
> > ...
> > 
> > The value returned by ld_gethostservbyname is undefined. I also wonder what 
> 
> > the program logic is:
> > If the host looks like an hex address in square brackets, host and port are 
> 
> > split at the colon; otherwise host and port are split at the colon.
> > Why not split simply at the last colon if the value is defined, AND THEN 
> > check if the components look OK?
> > 
> > So the "invalid address for virtual service" is only invalid when the 
> > resolver service (e.g. via LDAP) is unavailable.
> > I used host and service names for readability.
> > 
> > (I reported the issue to SLES support)
> > 
> > Regards,
> > Ulrich
> > 
> > 
> > 
> 
> 
> 
> 





More information about the Users mailing list