Bug 736 - Segfault during zone transfer
Segfault during zone transfer
Status: RESOLVED FIXED
Product: NSD
Classification: Unclassified
Component: NSD Code
4.1.x
i386 Linux
: P5 major
Assigned To: NSD team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-17 16:50 CET by Kim Alvefur
Modified: 2016-01-21 00:29 CET (History)
2 users (show)

See Also:


Attachments
strace (4.09 KB, application/octet-stream)
2016-01-17 16:50 CET, Kim Alvefur
Details
config (1.20 KB, application/octet-stream)
2016-01-17 16:52 CET, Kim Alvefur
Details
counting 47 segfaults (4.50 KB, text/x-log)
2016-01-18 15:49 CET, Kim Alvefur
Details
traceback, gdb: bt full (4.04 KB, text/plain)
2016-01-18 17:09 CET, Kim Alvefur
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kim Alvefur 2016-01-17 16:50:09 CET
Created attachment 314 [details]
strace

Hi

While attempting to correct a zone transfer ACL issue, my NSD 4.1.7 started segfaulting while doing AXFR:

Jan 17 16:29:09 sphyrna nsd[22730]: axfr for zash.se. from 2a00:66c0:1:1:5652:ff:fe79:f3c6
Jan 17 16:29:09 sphyrna nsd[19464]: server 22730 died unexpectedly, restarting
Jan 17 16:29:09 sphyrna nsd[19464]: process 22730 terminated with status 11
Comment 1 Kim Alvefur 2016-01-17 16:52:00 CET
Created attachment 315 [details]
config
Comment 2 Wouter Wijngaards 2016-01-18 13:56:19 CET
Hi Kim,

Can you make a stack trace, it seems to be triggered by the last zone transfer, for zash.se.  Can you perhaps run gdb --args nsd -d ...otheroptions  and then make a stack backtrace with 'bt'.  In gdb the command: " set follow-fork-mode child " will enable following of the process that seems to have died.

(you can also use set detach-on-fork no and then navigate the set of processes that gdb is following, but that is more work).

Best regards, Wouter
Comment 3 Kim Alvefur 2016-01-18 15:45:33 CET
Sorry, I can't seem to reproduce today.

While experimenting I set

provide-xfr: ::/0 NOKEY 
provide-xfr: 0.0.0.0/0 NOKEY

since I were having issues with ACLs at the same time.

After that the other NSes managed to get updated (might be a coincidence). Before they were out of date by about a week.
Comment 4 Kim Alvefur 2016-01-18 15:49:34 CET
Created attachment 316 [details]
counting 47 segfaults
Comment 5 Wouter Wijngaards 2016-01-18 15:53:11 CET
Hi Kim,

That dmesg list looks like the problem is repeatable.  I'd like to be able to repeat it, or to get a stack trace of it.  (the error message in dmesg does not tell me a lot).  With a coredump you can make a stacktrace afterwards with gdb <executable name> <coredumpfile> and then bt command.  Not sure how you enable coredumps for your platform, though.

Best regards, Wouter
Comment 6 Kim Alvefur 2016-01-18 16:20:00 CET
Yes, it happened every time I ran `nsd-control notify`, but as I said I'm unable to reproduce today and I have no clue why.

I have attempted to enable core dumps by adding `ulimit -c unlimited` to the startup script.

Looking at the 'ip' address from the crash log lines in a non-crashed core file:

(gdb) x 0x0806533c
0x806533c <query_axfr.part.5.2374+652>:	0x8510468b
Comment 7 Wouter Wijngaards 2016-01-18 16:24:08 CET
Hi Kim,

Thank you, I did not know about gdb x command.  That narrows it down a little bit, but I still do not really have a clue what is going wrong.  I tried to reproduce by reproducing your ACL lines, but that did not cause the segv.  Perhaps contents in your zonefile was the issue (was there something in it that caused the problem?).

Best regards, Wouter
Comment 8 Kim Alvefur 2016-01-18 17:09:33 CET
Created attachment 317 [details]
traceback, gdb: bt full

Ha!

This might be the guilty record:

_acme-challenge	TXT	"ukaIbkXPboRuJip1JFzc5uPHGgcjiH008h8gWRvBsM0.TQQ9zddid1u-FkpGMjDRVOzSqPiMHmSf_dflOsqg4co"

(I was experimenting with ACME when I found the AXFR problem)

traceback attached
Comment 9 Paul Aurich 2016-01-18 19:53:51 CET
I was able to reproduce this with Zash's zone file (though I'll leave it up to him to provide it if publicly) and `dig` performing a zone transfer.  I also couldn't reproduce it if I remove any single entry from the zone file, so it seems that it's perhaps size-related.

    % ls -l zash.se       
    -rw-r--r-- 1 paul paul 106070 Jan 18 09:59 zash.se

If I compile with --enable-checking, then it's tripping this assertion:
    nsd: axfr.c:98: query_axfr: Assertion `query->axfr_current_domain' failed.

On the fifth time through query_axfr().

From gdb (capturing a stack trace 

Breakpoint 1, query_axfr (nsd=0x6deca0 <nsd>, query=0x73f0b0) at axfr.c:98
98              assert(query->axfr_current_domain);
#0  query_axfr (nsd=0x6deca0 <nsd>, query=0x73f0b0) at axfr.c:98
#1  0x000000000040534c in answer_axfr_ixfr (nsd=0x6deca0 <nsd>, q=0x73f0b0) at axfr.c:203
#2  0x000000000041e7a1 in query_process (q=0x73f0b0, nsd=0x6deca0 <nsd>) at query.c:1436
#3  0x000000000045d9a8 in server_process_query (nsd=0x6deca0 <nsd>, query=0x73f0b0) at server.c:1882
#4  0x000000000045ebb0 in handle_tcp_reading (fd=14, event=2, arg=0x73e090) at server.c:2554
#5  0x00007ffff74d7179 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#6  0x000000000045e04d in server_child (nsd=0x6deca0 <nsd>) at server.c:2087
#7  0x000000000045a0bb in restart_child_servers (nsd=0x6deca0 <nsd>, region=0x716940, netio=0x723e50, xfrd_sock_p=0x6e0488) at server.c:324
#8  0x000000000045b2ed in server_start_children (nsd=0x6deca0 <nsd>, region=0x716940, netio=0x723e50, xfrd_sock_p=0x6e0488) at server.c:971
#9  0x000000000045cc2a in server_main (nsd=0x6deca0 <nsd>) at server.c:1568
#10 0x0000000000459544 in main (argc=0, argv=0x7fffffffe178) at nsd.c:1143

... (four more entries elided here that all looked like this last one, but didn't trip the assert) ...

Breakpoint 1, query_axfr (nsd=0x6deca0 <nsd>, query=0x73f0b0) at axfr.c:98
98              assert(query->axfr_current_domain);
#0  query_axfr (nsd=0x6deca0 <nsd>, query=0x73f0b0) at axfr.c:98
#1  0x000000000045f125 in handle_tcp_writing (fd=14, event=4, arg=0x73e090) at server.c:2715
#2  0x00007ffff74d7179 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.5
#3  0x000000000045e04d in server_child (nsd=0x6deca0 <nsd>) at server.c:2087
#4  0x000000000045a0bb in restart_child_servers (nsd=0x6deca0 <nsd>, region=0x716940, netio=0x723e50, xfrd_sock_p=0x6e0488) at server.c:324
#5  0x000000000045b2ed in server_start_children (nsd=0x6deca0 <nsd>, region=0x716940, netio=0x723e50, xfrd_sock_p=0x6e0488) at server.c:971
#6  0x000000000045cc2a in server_main (nsd=0x6deca0 <nsd>) at server.c:1568
#7  0x0000000000459544 in main (argc=0, argv=0x7fffffffe178) at nsd.c:1143

nsd: axfr.c:98: query_axfr: Assertion `query->axfr_current_domain' failed.
Comment 10 Wouter Wijngaards 2016-01-19 09:10:36 CET
Hi Kim,

Thank you for the stack trace!  It is size related (not that exact txt record, but it's size).  This is the patch, I think.

Does this fix the problem for you?  Regardless, I'll do this fix and attempt to construct an (artificial) test case.

Best regards, Wouter

Index: axfr.c
===================================================================
--- axfr.c	(revision 4584)
+++ axfr.c	(working copy)
@@ -95,9 +95,10 @@
 	}
 
 	/* Add zone RRs until answer is full.  */
-	assert(query->axfr_current_domain);
-
-	do {
+	while (query->axfr_current_domain != NULL &&
+			domain_is_subdomain(query->axfr_current_domain,
+					    query->axfr_zone->apex))
+	{
 		if (!query->axfr_current_rrset) {
 			query->axfr_current_rrset = domain_find_any_rrset(
 				query->axfr_current_domain,
@@ -128,9 +129,6 @@
 		query->axfr_current_domain
 			= domain_next(query->axfr_current_domain);
 	}
-	while (query->axfr_current_domain != NULL &&
-			domain_is_subdomain(query->axfr_current_domain,
-					    query->axfr_zone->apex));
 
 	/* Add terminating SOA RR.  */
 	assert(query->axfr_zone->soa_rrset->rr_count == 1);
Comment 11 Kim Alvefur 2016-01-20 16:00:27 CET
Patch applied and recompiled, has not crashed since :)
Comment 12 Wouter Wijngaards 2016-01-20 16:55:42 CET
Hi Kim,

Thanks for the test!

Best regards, Wouter
Comment 13 Kim Alvefur 2016-01-21 00:29:33 CET
Thanks Wouter for the guidance and thanks to Paul for helping debug this :)