Bug 1293

Summary: NSD 4.1.15 chokes on some zone transfers
Product: NSD Reporter: Anand Buddhdev <anandb>
Component: NSD CodeAssignee: NSD team <nsd-team>
Status: ASSIGNED ---    
Severity: major CC: wouter
Priority: P5    
Version: 4.1.x   
Hardware: x86_64   
OS: Linux   

Description Anand Buddhdev 2017-06-15 00:00:12 CEST
Two days, all three of our NSD 4.1.15 servers began, failing to update zones, and crashing, with segfault messages in the system log. It all starts with the following log entry, in each NSD server's log:

server1:

[2017-06-12 10:14:13.673] nsd[46574]: info: notify for 21.141.in-addr.arpa. from 93.175.159.250 serial 2005133427
[2017-06-12 10:14:13.879] nsd[5177]: info: xfrd: zone 21.141.in-addr.arpa. committed "received update to serial 2005133427 at 2017-06-12T10:14:13 from 93.175.15
9.250 TSIG verified with key main.ripe.net"
[2017-06-12 10:14:18.471] nsd[5177]: error: xfrd: zone 21.141.in-addr.arpa.: soa serial 2005133427 update failed, restarting transfer (notified zone)

server2:

[2017-06-12 10:14:13.670] nsd[13085]: info: notify for 21.141.in-addr.arpa. from 93.175.159.250 serial 2005133427
[2017-06-12 10:14:17.418] nsd[4934]: info: xfrd: zone 21.141.in-addr.arpa. committed "received update to serial 2005133427 at 2017-06-12T10:14:17 from 93.175.15
9.250 TSIG verified with key main.ripe.net"
[2017-06-12 10:14:24.223] nsd[4934]: error: xfrd: zone 21.141.in-addr.arpa.: soa serial 2005133427 update failed, restarting transfer (notified zone)

server3:

[2017-06-12 10:14:13.658] nsd[1724]: info: notify for 21.141.in-addr.arpa. from 93.175.159.250 serial 2005133427
[2017-06-12 10:14:17.637] nsd[4910]: info: xfrd: zone 21.141.in-addr.arpa. committed "received update to serial 2005133427 at 2017-06-12T10:14:17 from 93.175.15
9.250 TSIG verified with key main.ripe.net"
[2017-06-12 10:14:24.029] nsd[4910]: error: xfrd: zone 21.141.in-addr.arpa.: soa serial 2005133427 update failed, restarting transfer (notified zone)

After this first failure to update the 21.141.in-addr.arpa zone, NSD logs similar failures for many OTHER zones on each server.

Some hours later, NSD processes begin crashing, and the dmesg log shows segfaults.

server1:

# dmesg -e|grep fault
[Jun12 15:22] nsd[5462]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[  +1.097402] nsd[71641]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[Jun12 15:23] nsd[71649]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[ +38.792784] nsd[72668]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[Jun12 15:51] nsd[83898]: segfault at 8 ip 0000000000417dd3 sp 00007fffc1011168 error 4 in nsd[400000+68000]
[Jun12 15:58] nsd[87107]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[  +0.708763] nsd[89842]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[Jun12 16:04] nsd[89843]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[  +0.733247] nsd[92243]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[Jun12 16:12] nsd[92272]: segfault at 8 ip 0000000000417dd3 sp 00007fffc1011168 error 4 in nsd[400000+68000]
[Jun12 16:42] nsd[97046]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[  +0.722844] nsd[111239]: segfault at 8 ip 0000000000417dd3 sp 00007fffc1011168 error 4 in nsd[400000+68000]
[Jun12 16:52] nsd[111240]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[  +0.748578] nsd[117619]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[Jun12 17:04] nsd[117627]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[  +0.670130] nsd[122576]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[ +19.360878] nsd[122579]: segfault at 8 ip 0000000000417dd3 sp 00007fffc1011168 error 4 in nsd[400000+68000]
[  +2.344533] nsd[137070]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[  +0.738339] nsd[144803]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[Jun12 18:02] nsd[144831]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]

server2:

# dmesg -e|grep fault
[ +33.276327] nsd[5207]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[Jun12 15:04] nsd[29132]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[ +40.834929] nsd[29133]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +0.936703] nsd[29374]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[Jun12 15:26] nsd[29375]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[Jun12 15:40] nsd[7029]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +0.940670] nsd[11804]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +7.260338] nsd[11807]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +0.970657] nsd[16431]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +2.194095] nsd[16432]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[Jun12 16:34] nsd[32423]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +1.054949] nsd[2945]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[Jun12 16:44] nsd[2953]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +1.040242] nsd[6642]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[ +14.294976] nsd[6652]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[Jun12 17:27] nsd[19516]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +0.921562] nsd[26277]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]
[  +2.071890] nsd[26285]: segfault at 18 ip 00000000004175b5 sp 00007ffededf5490 error 4 in nsd[400000+68000]

server3:

# dmesg -e|grep fault
[Jun12 15:04] nsd[5206]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[  +0.910901] nsd[31733]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[Jun12 15:05] nsd[31735]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[  +1.008571] nsd[31975]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[Jun12 15:28] nsd[31981]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[Jun12 15:35] nsd[13923]: segfault at 8 ip 0000000000417dd3 sp 00007ffeeea67fb8 error 4 in nsd[400000+68000]
[Jun12 15:40] nsd[16927]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[  +0.922254] nsd[20213]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[Jun12 15:47] nsd[20283]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[  +0.987350] nsd[25369]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[Jun12 15:56] nsd[25370]: segfault at 8 ip 0000000000417dd3 sp 00007ffeeea67fb8 error 4 in nsd[400000+68000]
[Jun12 16:24] nsd[29814]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[Jun12 16:34] nsd[14004]: segfault at 8 ip 0000000000417dd3 sp 00007ffeeea67fb8 error 4 in nsd[400000+68000]
[  +1.520211] nsd[19381]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[  +3.632521] nsd[19382]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[  +0.943951] nsd[27736]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[Jun12 17:14] nsd[27737]: segfault at 8 ip 0000000000417dd3 sp 00007ffeeea67fb8 error 4 in nsd[400000+68000]
[Jun12 17:28] nsd[9522]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[  +0.960532] nsd[19158]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]
[Jun12 17:45] nsd[19164]: segfault at 18 ip 00000000004175b5 sp 00007ffeeea67f30 error 4 in nsd[400000+68000]

These segfaults correspond to lines like these in the NSD logs:

[2017-06-12 15:18:31.897] nsd[71641]: error: handle_reload_cmd: reload closed cmd channel
[2017-06-12 15:18:31.897] nsd[71641]: warning: Reload process 5462 failed, continuing with old database
[2017-06-12 15:18:31.898] nsd[5177]: error: process 5462 exited with status 11

Then, our monitoring system alerted us that several zones were starting to lag behind, and serving old serials. Eventually, my colleague logged into each server, stopped NSD fully, and started it up. After that it was fine.

Sine this happened at the same time on 3 independent servers, and at the same time, it seems to point to some kind of bug being triggered in NSD, probably caused by an incoming zone transfer that NSD failed to handle properly. Perhaps it confused NSD, causing it to start failing other zone transfers too.

I don't have much more evidence other than the logs. If there's anything else you'd like to see, please let me know ASAP, so I can provide it if it's still available.
Comment 1 Wouter Wijngaards 2017-06-15 09:47:12 CEST
Hi Anand,

Can you provide the zone transfer file from /tmp for that.  It would likely still be in /tmp/nsd.../...file.  That gives the zone transfer itself so I can see what happened inside it.

It probably queued up this work item, and every other zone transfer kicked off a process where this work item crashes the reload command.

There is a command that translates the 
[ +38.792784] nsd[72668]: segfault at 18 ip 00000000004175b5 sp 00007fffc10110e0 error 4 in nsd[400000+68000]
[Jun12 15:51] nsd[83898]: segfault at 8 ip 0000000000417dd3 sp 00007fffc1011168 error 4 in nsd[400000+68000]
lines into a readable version, but it has to run on your machine (your version).  It is called addr2line, can you run that for me (on a non-stripped version of the  executable)?

Can I have the 'AXFR' for that zone (i.e. the full zone), perhaps it contains weird stuff?  Can you download the IXFR (from one version ago) with dig -t ixfr=N (previous serial number) @server <zonename> ?   2005133426 I guess.  So I can see if there is weird stuff in that IXFR.

Best regards, Wouter
Comment 2 Anand Buddhdev 2017-06-15 11:57:00 CEST
Hi Wouter. First, the addr2line output:

# addr2line -e /usr/lib/debug/usr/sbin/nsd.debug 00000000004175b5                                                                                 
/usr/src/debug/nsd-4.1.15/rbtree.c:196

Next, there are 183 unprocessed zone transfers. The first one in the directory is not for the zone I reported, so perhaps it was another zone that caused the problem. I've made a tarball of all the failed xfrs, but it is 318 MB in size. I will make it available to you via a private channel.

The IXFR of the zone that first exhibited the error is attached, but it looks rather ordinary.

However, I'm starting to suspect that the problem may have been caused by another zone transfer, related to .SY and its subdomains, which we also slave on this server, because I've also found this in the logs for the same time period:

[2017-06-12 15:18:32.802] nsd[29075]: info: notify for edu.sy. from 193.0.19.190 serial 2017061231
[2017-06-12 15:18:32.959] nsd[29119]: info: notify for edu.sy. from 93.175.159.250 serial 2017061231
[2017-06-12 15:18:33.067] nsd[4934]: info: xfrd: zone edu.sy. committed "received update to serial 2017061231 at 2017-06-12T15:18:33 from 93.175.159.250 TSIG ve
rified with key main.ripe.net"
[2017-06-12 15:18:33.506] nsd[29133]: error: handle_reload_cmd: reload closed cmd channel
[2017-06-12 15:18:33.506] nsd[29133]: warning: Reload process 29132 failed, continuing with old database
[2017-06-12 15:18:33.507] nsd[4934]: error: xfrd: zone sy.: soa serial 2017061231 update failed, restarting transfer (notified zone)
[2017-06-12 15:18:33.507] nsd[4934]: error: xfrd: zone gov.sy.: soa serial 2017061230 update failed, restarting transfer (notified zone)
[2017-06-12 15:18:33.730] nsd[29075]: info: notify for name.sy. from 193.0.19.190 serial 2017061230
[2017-06-12 15:18:33.854] nsd[29075]: info: notify for name.sy. from 93.175.159.250 serial 2017061230
[2017-06-12 15:18:33.858] nsd[4934]: info: xfrd: zone name.sy. committed "received update to serial 2017061230 at 2017-06-12T15:18:33 from 193.0.19.190 TSIG ver
ified with key main.ripe.net"
[2017-06-12 15:18:33.861] nsd[29133]: error: edu.sy. NSEC3PARAM entry has no hash(apex).
[2017-06-12 15:18:33.861] nsd[29133]: error: hash(apex)= 2ffgr073s7scl3mco0ruipnsq948o7at.edu.sy.
[2017-06-12 15:18:33.861] nsd[29133]: info: zone edu.sy. received update to serial 2017061231 at 2017-06-12T15:18:33 from 93.175.159.250 TSIG verified with key main.ripe.net of 32697 bytes in 0.000488 seconds
Comment 3 Wouter Wijngaards 2017-06-15 11:57:52 CEST
Hi,

Can you also do addr2line -e /usr/lib/debug/usr/sbin/nsd.debug 0000000000417dd3 ?

Best regards, Wouter
Comment 4 Wouter Wijngaards 2017-06-15 11:59:39 CEST
Hi Anand,

Yes I also think that edu.sy looks like it is the (earliest?) problem.

And then I would be interested in that zone(file) and it zone transfer IXFR...

Best regards, Wouter
Comment 5 Wouter Wijngaards 2017-06-15 12:14:24 CEST
Hi Anand,

Yes in your transfer the first couple of files are all .sy domains,
xfr.578541, xfr.578542, xfr.578543, xfr.578546, xfr.578550, xfr.578551,

starting with info.sy, .sy, gov.sy.  So likely .sy or one of its subtlds.

Best regards, Wouter
Comment 6 Anand Buddhdev 2017-06-15 12:16:40 CEST
# addr2line -e /usr/lib/debug/usr/sbin/nsd.debug 0000000000417dd3
/usr/src/debug/nsd-4.1.15/rbtree.c:500 (discriminator 1)
Comment 7 Wouter Wijngaards 2017-06-15 12:24:19 CEST
Hi Anand,

So I'm going to inspect the XFRs for .sy (and <tld>.sy), but that takes some time; it is likely that I would want the full zone data for them (they are IXFRs) to compare against.  If you could set those zones (or all zones) aside for me, so I can get a (close) serial number (if I need to ask later for that)?

Best regards, Wouter
Comment 8 Anand Buddhdev 2017-06-15 12:27:40 CEST
I've AXFRed all the .SY zones and kept them aside, in case you need them.
Comment 9 Anand Buddhdev 2017-06-19 12:37:36 CEST
Hi Wouter. Have you found anything in the xfr files? Do you need any more data from me?
Comment 10 Wouter Wijngaards 2017-06-19 13:37:39 CEST
Hi Anand,

Got distracted by Unbound security related issue.  But I did investigate, but haven't been able to find or reproduce the bug.

The .sy domains are undergoing NSEC3 resalt.  Changing salt.  They send an IXFR.  They send the new NSEC3PARAM record, but they do not remove the old NSEC3PARAM record (likely invalidating the RRSIG over that record).  This causes NSD to complain about the bad NSEC3 parameters.  I can reproduce it that far.  I then understand that something is going on with NSEC3, and I guess some NULL pointer in the rbtree happens; but I cannot reproduce that part.  NSD works fine when I resalt NSEC3.

Now I can fixup the NSEC3PARAM selection for their odd not-deleting-the-old-one behaviour, but I really want to find the NULL ptr exception.

So I would appreciate the <tld>.sy (and perhaps .sy itself) zone files.  I want to see if there is something odd in there - that would not be ordinary delegations - but perhaps oddness in the NSEC3 chain.  I doubt I'd be able to replay that IXFR and somehow reproduce that bug by doing that.  But I seem to have no other option on how to find out more, and what I reproduce so far, NSEC3 resalt and wrong NSEC3PARAM does not fail.

Specifically edu.sy because that is what NSD complains about and then crashes. 

Best regards, Wouter
Comment 11 Anand Buddhdev 2017-06-19 13:47:21 CEST
I have the .SY zone files, as well as the BIND journals, so I will make those available to you. Perhaps you can set up a BIND master and make NSD request IXFR from it using the journals I provide. Check your email in a few minutes for a link to these files.
Comment 12 Wouter Wijngaards 2017-06-19 13:49:32 CEST
Hi Anand,

Just edu.sy because that seems to be the one it is crashing on.

Best regards, Wouter
Comment 13 Wouter Wijngaards 2017-06-21 15:48:05 CEST
Hi Anand,

I cannot figure out what happened exactly, but I think it may be this patch.  I am still trying to reproduce the issue with the (ample) data you provided.  If it happens again (reproducibly) I would appreciate a (higher) verbosity (or even -F 60 -L 2) log of the actions just before where it says 'reload closed cmd channel'.

I include the patch in case it happens again and you want to be protected.

Best regards, Wouter

Index: namedb.c
===================================================================
--- namedb.c	(revision 4756)
+++ namedb.c	(working copy)
@@ -305,6 +305,7 @@
 {
 	if(!*tree)
 		*tree = rbtree_create(region, cmpf);
+	if(node->key) return;
 	memset(node, 0, sizeof(rbnode_type));
 	node->key = domain;
 	rbtree_insert(*tree, node);
Comment 14 Anand Buddhdev 2017-06-21 16:14:40 CEST
Hi Wouter. Normally, we do not run the server with "-F 60 -L 2". If the issue happens again, the only way to get a debug log like that will involve restarting the NSD server, and that actually fixes the issue. We can't really run our production servers with such high verbosity all the time, so I don't know how to give you detailed logs when/if this issues occurs again.
Comment 15 Anand Buddhdev 2017-06-23 12:02:05 CEST
Hi Wouter. I've been looking at the logs and data again, and I'm not certain that the problem is with edu.sy. The first time that a reload fails is when NSD attempts to update info.sy. This is also the very first XFR in directory of unconsumed XFRs. Let me send you the zone file and journal of this zone. Perhaps you can see something in it, or reproduce the issue.
Comment 16 Wouter Wijngaards 2017-06-23 12:03:39 CEST
Hi Anand,

Yes, certainly.

Best regards, Wouter
Comment 17 Anand Buddhdev 2017-06-23 12:13:16 CEST
Oh, I just realised that I don't need to send you info.sy. It should already be in the tarball I made available earlier, along with its BIND journal, which you can display with "named-journalprint".