Bug 423 - nsd process hangs after reload (after SERV FAIL from master ?)
nsd process hangs after reload (after SERV FAIL from master ?)
Status: RESOLVED FIXED
Product: NSD
Classification: Unclassified
Component: NSD Code
3.2.x
Other Linux
: P5 blocker
Assigned To: NSD team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-12-13 02:13 CET by Vicky Shrestha
Modified: 2012-01-18 16:35 CET (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vicky Shrestha 2011-12-13 02:13:16 CET
Hi,

I am seeing random hang on NSD process. Looking at logs suggests it happens after it receives random/temporary SERV FAIL from masters. After a nsd reload we see 4 total process, with 1 hung child process. Strace shows no activity on this process. Zone updates will stop working and it only serves from nsd.db before the hang.


nsd -version
NSD version 3.2.9
Written by NLnet Labs.

Copyright (C) 2001-2011 NLnet Labs.  This is free software.
There is NO warranty; not even for MERCHANTABILITY or FITNESS

Using configuration options:
%configure \
--disable-full-prehash \
--enable-root-server \
--enable-nsec3 \
--with-user=nsd \
--with-ssl \
--with-pidfile=/var/nsd/cache/nsd.pid \
--with-dbfile=/var/nsd/nsd.db \
--with-difffile=/var/nsd/cache/ixfr.db \
--with-xfrdfile=/var/nsd/cache/xfrd.state \
--with-configdir=/var/nsd \
--with-zonesdir=/var/nsd \
--with-zonesdir=/var/nsd

$uname -a
Linux afilias.nuq.pch.net 2.6.32-131.17.1.el6.x86_64 #1 SMP Thu Oct 6 19:24:09 BST 2011 x86_64 x86_64 x86_64 GNU/Linux

CentOS 6.0 x86_64 Virtual Machine running under ESXi 4.1


[1323735545] nsd[15318]: error: xfrd: zone ac.in received error code SERV FAIL from 206.220.230.34
[1323735545] nsd[15318]: error: xfrd: zone co.in received error code SERV FAIL from 206.220.230.34
[1323735552] nsd[15458]: warning: signal received, reloading...
[1323735557] nsd[15472]: info: memory recyclebin holds 132024 bytes
[1323735557] nsd[15318]: info: Zone co.in serial 0 is updated to 2008493670.
[1323735557] nsd[15318]: info: Zone net.in serial 0 is updated to 2008117440.
[1323735557] nsd[15318]: info: Zone gen.in serial 0 is updated to 2008071787.
[1323735557] nsd[15318]: info: Zone gov.in serial 0 is updated to 2008065231.
[1323735557] nsd[15318]: info: Zone ac.in serial 0 is updated to 2008069949.
[1323735557] nsd[15318]: info: Zone xn--fpcrj9c3d serial 0 is updated to 2010100700.
[1323735557] nsd[15318]: info: Zone xn--gecrj9c serial 0 is updated to 2010100700.
[1323735587] nsd[15472]: warning: signal received, reloading...
[root@afilias logs]# strace -p 15481
Process 15481 attached - interrupt to quit


[root@afilias cache]# ps ax|grep nsd.in.conf
15318 ?        S      0:00 /usr/sbin/nsd -c /var/nsd/nsd.in.conf
15472 ?        S      0:05 /usr/sbin/nsd -c /var/nsd/nsd.in.conf
15475 ?        S      0:00 /usr/sbin/nsd -c /var/nsd/nsd.in.conf
15481 ?        R     45:10 /usr/sbin/nsd -c /var/nsd/nsd.in.conf
Comment 1 Wouter Wijngaards 2011-12-13 13:13:08 CET
Hi,

It seems to hang in runnable condition?  If so can you attach gdb to the process and produce a stacktrace (bt command from gdb, attach it via the commandline options for pid and executable file).

Not sure how the SERVFAIL could cause anything out of the normal to happen.

Best regards,
   Wouter
Comment 2 Wouter Wijngaards 2011-12-13 13:20:41 CET
Hi Vicky,

Another question, I am curious why you configure with --enable-root-server?
(you have --with-zonesdir twice, by the way).

Best regards,
   Wouter
Comment 3 Vicky Shrestha 2011-12-15 07:58:15 CET
Hmm, my email reply to bugzilla-daemon@NLnetLabs.nl didn't land my comments here.

Anyways:

I will post the gdb bt , the next time nsd hangs.

As for the -enable-root-server:
We use the same package for all our installations and it is required on one of our instances.
Comment 4 Vicky Shrestha 2011-12-17 11:45:09 CET
This time there are two hung process nsd and nsd-patch. It seems it is going over and over around function label_compare(). 

Sorry about the missing debuginfo, I have build a new package with debuginfo and will run that. Let me know if this is helpful.

# ps ax|grep nsd.info.conf
16587 ?        S      0:13 /usr/sbin/nsd -c nsd.info.conf
25518 ?        S      0:00 /usr/sbin/nsd -c nsd.info.conf
25519 ?        S      0:00 /usr/sbin/nsd -c nsd.info.conf
25592 ?        R    2795:34 /usr/sbin/nsd -c nsd.info.conf
25842 ?        S      0:00 /bin/sh /usr/sbin/nsdc -c /var/nsd/nsd.info.conf patch
25865 ?        R    2777:36 /usr/sbin/nsd-patch -c /var/nsd/nsd.info.conf -x /var/nsd/cache/ixfr.info.db.25842 -s -o /var/nsd/info/nsd.db.25842


# strace -p 16587
Process 16587 attached - interrupt to quit
pselect6(16, [15], [], [], {3543, 210020699}, {NULL, 8}^C <unfinished ...>
Process 16587 detached

# strace -p 25518
Process 25518 attached - interrupt to quit
pselect6(17, [3 14 16], [], [], {27, 919128127}, {NULL, 8}^C <unfinished ...>
Process 25518 detached

# strace -p 25519
Process 25519 attached - interrupt to quit
pselect6(16, [4 5 6 7 8 9 10 11 12 13 15], [], [], NULL, {NULL, 8}^C <unfinished ...>
Process 25519 detached



# strace -p 25592
Process 25592 attached - interrupt to quit
^CProcess 25592 detached


]# strace -p 25865
Process 25865 attached - interrupt to quit
^CProcess 25865 detached


# gdb -p 25592 /usr/sbin/nsd
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/nsd...(no debugging symbols found)...done.
Attaching to program: /usr/sbin/nsd, process 25592
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x0000000000415997 in rbtree_next ()
Missing separate debuginfos, use: debuginfo-install nsd-3.2.9-1.el6.x86_64
(gdb) bt
#0  0x0000000000415997 in rbtree_next ()
#1  0x0000000000408b77 in has_data_below ()
#2  0x0000000000408d78 in rrset_delete ()
#3  0x0000000000408dfe in delete_zone_rrs ()
#4  0x0000000000409d1c in apply_ixfr ()
#5  0x000000000040a415 in read_sure_part ()
#6  0x000000000040aad5 in diff_read_file ()
#7  0x000000000041a0f1 in server_main ()
#8  0x00000000004109b1 in main ()
(gdb) quit
A debugging session is active.

	Inferior 1 [process 25592] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/nsd, process 25592
[root@afilias ~]# gdb -p 25865 /usr/sbin/nsd-patch 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/nsd-patch...(no debugging symbols found)...done.
Attaching to program: /usr/sbin/nsd-patch, process 25865
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x0000000000409cfe in label_compare ()
Missing separate debuginfos, use: debuginfo-install nsd-3.2.9-1.el6.x86_64
(gdb) bt
#0  0x0000000000409cfe in label_compare ()
#1  0x0000000000409ec6 in dname_is_subdomain ()
#2  0x0000000000407305 in has_data_below ()
#3  0x0000000000407518 in rrset_delete ()
#4  0x000000000040759e in delete_zone_rrs ()
#5  0x00000000004084bc in apply_ixfr ()
#6  0x0000000000408bb5 in read_sure_part ()
#7  0x0000000000409275 in diff_read_file ()
#8  0x000000000040c3e1 in main ()
(gdb) quit
A debugging session is active.

	Inferior 1 [process 25865] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/sbin/nsd-patch, process 25865


[root@afilias src]# gdb -p 25592 /usr/sbin/nsd
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/nsd...(no debugging symbols found)...done.
Attaching to program: /usr/sbin/nsd, process 25592
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
0x000000000040b534 in label_compare ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 openssl-1.0.0-10.el6_1.5.x86_64 zlib-1.2.3-26.el6.x86_64
(gdb) bt
#0  0x000000000040b534 in label_compare ()
#1  0x000000000040b726 in dname_is_subdomain ()
#2  0x0000000000408b65 in has_data_below ()
#3  0x0000000000408d78 in rrset_delete ()
#4  0x0000000000408dfe in delete_zone_rrs ()
#5  0x0000000000409d1c in apply_ixfr ()
#6  0x000000000040a415 in read_sure_part ()
#7  0x000000000040aad5 in diff_read_file ()
#8  0x000000000041a0f1 in server_main ()
#9  0x00000000004109b1 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x000000000040b6ce in dname_is_subdomain ()
(gdb) bt
#0  0x000000000040b6ce in dname_is_subdomain ()
#1  0x0000000000408b65 in has_data_below ()
#2  0x0000000000408d78 in rrset_delete ()
#3  0x0000000000408dfe in delete_zone_rrs ()
#4  0x0000000000409d1c in apply_ixfr ()
#5  0x000000000040a415 in read_sure_part ()
#6  0x000000000040aad5 in diff_read_file ()
#7  0x000000000041a0f1 in server_main ()
#8  0x00000000004109b1 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x000000000040b6ce in dname_is_subdomain ()
(gdb) bt
#0  0x000000000040b6ce in dname_is_subdomain ()
#1  0x0000000000408b65 in has_data_below ()
#2  0x0000000000408d78 in rrset_delete ()
#3  0x0000000000408dfe in delete_zone_rrs ()
#4  0x0000000000409d1c in apply_ixfr ()
#5  0x000000000040a415 in read_sure_part ()
#6  0x000000000040aad5 in diff_read_file ()
#7  0x000000000041a0f1 in server_main ()
#8  0x00000000004109b1 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x000000000040b74f in dname_is_subdomain ()
(gdb) bt
#0  0x000000000040b74f in dname_is_subdomain ()
#1  0x0000000000408b65 in has_data_below ()
#2  0x0000000000408d78 in rrset_delete ()
#3  0x0000000000408dfe in delete_zone_rrs ()
#4  0x0000000000409d1c in apply_ixfr ()
#5  0x000000000040a415 in read_sure_part ()
#6  0x000000000040aad5 in diff_read_file ()
#7  0x000000000041a0f1 in server_main ()
#8  0x00000000004109b1 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x000000000040b556 in label_compare ()
(gdb) bt
#0  0x000000000040b556 in label_compare ()
#1  0x000000000040b726 in dname_is_subdomain ()
#2  0x0000000000408b65 in has_data_below ()
#3  0x0000000000408d78 in rrset_delete ()
#4  0x0000000000408dfe in delete_zone_rrs ()
#5  0x0000000000409d1c in apply_ixfr ()
#6  0x000000000040a415 in read_sure_part ()
#7  0x000000000040aad5 in diff_read_file ()
#8  0x000000000041a0f1 in server_main ()
#9  0x00000000004109b1 in main ()
(gdb) 


# gdb -p 25865 /usr/sbin/nsd-patch
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/nsd-patch...(no debugging symbols found)...done.
Attaching to program: /usr/sbin/nsd-patch, process 25865
Reading symbols from /usr/lib64/libcrypto.so.10...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libcrypto.so.10
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x0000000000412417 in rbtree_next ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.25.el6_1.3.x86_64 openssl-1.0.0-10.el6_1.5.x86_64 zlib-1.2.3-26.el6.x86_64
(gdb) bt
#0  0x0000000000412417 in rbtree_next ()
#1  0x0000000000407317 in has_data_below ()
#2  0x0000000000407518 in rrset_delete ()
#3  0x000000000040759e in delete_zone_rrs ()
#4  0x00000000004084bc in apply_ixfr ()
#5  0x0000000000408bb5 in read_sure_part ()
#6  0x0000000000409275 in diff_read_file ()
#7  0x000000000040c3e1 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0000000000409eb9 in dname_is_subdomain ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0000000000409d03 in label_compare ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0000000000409eef in dname_is_subdomain ()
(gdb) bt
#0  0x0000000000409eef in dname_is_subdomain ()
#1  0x0000000000407305 in has_data_below ()
#2  0x0000000000407518 in rrset_delete ()
#3  0x000000000040759e in delete_zone_rrs ()
#4  0x00000000004084bc in apply_ixfr ()
#5  0x0000000000408bb5 in read_sure_part ()
#6  0x0000000000409275 in diff_read_file ()
#7  0x000000000040c3e1 in main ()
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x0000000000409eef in dname_is_subdomain ()
(gdb) bt
#0  0x0000000000409eef in dname_is_subdomain ()
#1  0x0000000000407305 in has_data_below ()
#2  0x0000000000407518 in rrset_delete ()
#3  0x000000000040759e in delete_zone_rrs ()
#4  0x00000000004084bc in apply_ixfr ()
#5  0x0000000000408bb5 in read_sure_part ()
#6  0x0000000000409275 in diff_read_file ()
#7  0x000000000040c3e1 in main ()
Comment 5 Willem Toorop 2011-12-22 16:58:07 CET
> Program received signal SIGINT, Interrupt.
> 0x0000000000409eef in dname_is_subdomain ()
> (gdb) bt
> #0  0x0000000000409eef in dname_is_subdomain ()
> #1  0x0000000000407305 in has_data_below ()
> #2  0x0000000000407518 in rrset_delete ()
> #3  0x000000000040759e in delete_zone_rrs ()
> #4  0x00000000004084bc in apply_ixfr ()
> #5  0x0000000000408bb5 in read_sure_part ()
> #6  0x0000000000409275 in diff_read_file ()
> #7  0x000000000040c3e1 in main ()

Yes... it appears to hang deleting the zone content (with delete_zone_rrs) before applying a AXFR.
If you have a situation like this, and you copy the nsd.db and ixfr.db and ixfrd.state files, you can probably reproduce the error in isolation.
Comment 6 Matthijs Mekking 2012-01-18 16:35:24 CET
Hi Vicky,

We have encountered the same behavior. Actually, NSD does not hang, but it has become very slow when handling a zone transfer. In short, this is because during the XFR process, NSD tries to fix the existence of empty terminals. The fix is to this after the XFR process is complete.

The fix is in branches/NSD_3_2, r3504.

I'll prepare a release. In the meantime, you can test the branch if you want. If you prefer, I can make a snapshot release instead (you can contact me offline).