Bug 1172

Summary: Reload process %d failed
Product: NSD Reporter: Jared Mauch <jared>
Component: NSD CodeAssignee: NSD team <nsd-team>
Status: ASSIGNED ---    
Severity: major CC: wouter
Priority: P5    
Version: 4.1.x   
Hardware: Other   
OS: All   

Description Jared Mauch 2016-12-02 17:06:58 CET
When using nsd-control reload there appear to be failures that cause no further updates to occur to the in-memory zones without error logging to diagnose what has transpired (eg: reason for failure).  There are zones that have not updated in several days as a result.

Example logs:

Nov 20 15:22:20 puck nsd[437]: process 595 exited with status 6
Nov 20 15:22:20 puck nsd[32028]: handle_reload_cmd: reload closed cmd channel
Nov 20 15:22:20 puck nsd[32028]: Reload process 595 failed, continuing with old database
Comment 1 Wouter Wijngaards 2016-12-02 17:17:01 CET
Hi Jared,

This means that a zone transfer (IXFR or AXFR) that has just been fetched causes NSD to segfault in the reload-process.  I would like to fix that segfault, can you give me more debug information.

Can you give the zone update that causes the problem?

I would like more (log) information about process 595, it was just forked by NSD to process the update.  You can increase verbosity -V 5 at startup.  Or you can try to increase logging even further with -F 60 -L 2  (that prints XFRD and IPC debug information).  That should mean that the calls that fork that process are printed (the IPC information), and the XFRD information logs may also be of use.  Hopefully that can pinpoint what zone is causing the problem, and then I would also like to have the contents that cause it, so I can replicate the failure.

Or, if you could, a stack-trace of process 595 (I mean the one that had failure 6) is exactly what I am looking for.  If you make a coredump, you also should make the stacktrace, gdb needs your machine to work out the stacktrace.

Best regards, Wouter
Comment 2 Jared Mauch 2016-12-02 20:01:05 CET
After looking closer, I don't think the process# is what you think it is, because I have entries like this in my log:

Nov 22 19:23:06 puck nsd[20489]: handle_reload_cmd: reload closed cmd channel
Nov 22 19:23:06 puck nsd[20489]: Reload process 1 failed, continuing with old database

I've set verbose to 5 and built the 14-rc1 version and am running it.

I'm not able to see some of the other problems right now, eg: stats_noreset would take ~20 seconds to respond.
Comment 3 Wouter Wijngaards 2016-12-05 14:55:09 CET
Hi Jared,

process 1? That sounds like a problem in the code so that it prints the wrong number, process 1 is the idle process, or a system process?

If you still have the reload process failed log entries, can you increase verbosity (eg. --enable-debug compile, -F 60 -L 2) so that it is possible to see what went wrong just before it crashes?

Best regards, Wouter