Bug 288 - nsd hangs (defunct) on third reload
nsd hangs (defunct) on third reload
Status: RESOLVED FIXED
Product: NSD
Classification: Unclassified
Component: NSD Code
3.2.x
i386 Linux
: P2 major
Assigned To: NSD team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-12-07 11:17 CET by Alexey Ivanov
Modified: 2010-02-17 09:30 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexey Ivanov 2009-12-07 11:17:29 CET
I'm using Chef for automated config deployment.
After third(!!) reload all nsd childs become zombies

#ps -afx
13520 ?        S      0:00 /usr/sbin/nsd
13588 ?        S      0:00  \_ /usr/sbin/nsd
13589 ?        Z      0:00      \_ [nsd] <defunct>
13590 ?        Z      0:00      \_ [nsd] <defunct>
13591 ?        Z      0:00      \_ [nsd] <defunct>
13592 ?        Z      0:00      \_ [nsd] <defunct>

After all next reloads it just says:
nsd[13520]: warning: Reload already in progress (pid = 13588)
nsd[13520]: warning: Reload already in progress (pid = 13588)
nsd[13520]: warning: Reload already in progress (pid = 13588)

Only restart helps.

init script (reload section):

    reload)
        ## Like force-reload, but if daemon does not support
        ## signaling, do nothing (!)

        # If it supports signaling:
        echo -n "Reload service nsd "
        /usr/sbin/nsdc rebuild >/dev/null && /usr/sbin/nsdc notify
        killproc -HUP $NSD_BIN
        #touch /var/run/nsd.pid
(bu i've tried deleting all except -HUP - same effect)

System: SLES11
# uname -a
Linux ns0.scalaxy.ru 2.6.27.23-0.1.1 x86_64(!)
# nsd -v
NSD version 3.2.2
Same problem with latest 3.2.3

Fulllog(from start to third reload):

#cat /var/lib/nsd/var/log/nsd/nsd.log
[1260179815] nsd[13450]: notice: nsd started (NSD 3.2.2), pid 13450
[1260179823] nsd[13455]: info: NSTATS 1260179823 1260179815 TYPE252=1
[1260179823] nsd[13455]: info: XSTATS 1260179823 1260179815 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=0 SFwdQ=0 SDupQ=0 SErr=0 RQ=1 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=0 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179823] nsd[13450]: warning: server 13455 died unexpectedly with status 0, restarting
[1260179823] nsd[13454]: info: NSTATS 1260179823 1260179815 A=3 TYPE252=1
[1260179823] nsd[13454]: info: XSTATS 1260179823 1260179815 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=3 SFwdQ=0 SDupQ=0 SErr=0 RQ=4 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=3 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179823] nsd[13450]: warning: server 13454 died unexpectedly with status 0, restarting
[1260179823] nsd[13453]: info: NSTATS 1260179823 1260179815 A=3
[1260179823] nsd[13453]: info: XSTATS 1260179823 1260179815 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=3 SFwdQ=0 SDupQ=0 SErr=0 RQ=3 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=0 SFwdR=0 SFail=3 SFErr=0 SNaAns=0 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179823] nsd[13450]: warning: server 13453 died unexpectedly with status 0, restarting
[1260179823] nsd[13452]: info: NSTATS 1260179823 1260179815 A=2 TYPE252=1
[1260179823] nsd[13452]: info: XSTATS 1260179823 1260179815 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=2 SFwdQ=0 SDupQ=0 SErr=0 RQ=3 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=2 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179823] nsd[13450]: warning: server 13452 died unexpectedly with status 0, restarting
[1260179823] nsd[13450]: warning: xfrd process 13451 failed with status 0, restarting 
[1260179823] nsd[13450]: warning: signal received, reloading...
[1260179823] nsd[13520]: info: memory recyclebin holds 0 bytes
[1260179857] nsd[13524]: info: NSTATS 1260179857 1260179823 TYPE252=1
[1260179857] nsd[13524]: info: XSTATS 1260179857 1260179823 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=0 SFwdQ=0 SDupQ=0 SErr=0 RQ=1 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=0 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179857] nsd[13523]: info: NSTATS 1260179857 1260179823 A=8
[1260179857] nsd[13523]: info: XSTATS 1260179857 1260179823 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=8 SFwdQ=0 SDupQ=0 SErr=0 RQ=8 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=0 SFwdR=0 SFail=8 SFErr=0 SNaAns=0 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179857] nsd[13520]: warning: server 13523 died unexpectedly with status 0, restarting
[1260179857] nsd[13520]: warning: server 13524 died unexpectedly with status 0, restarting
[1260179857] nsd[13522]: info: NSTATS 1260179857 1260179823 A=7 TYPE252=1
[1260179857] nsd[13522]: info: XSTATS 1260179857 1260179823 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=7 SFwdQ=0 SDupQ=0 SErr=0 RQ=8 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=7 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179857] nsd[13520]: warning: server 13522 died unexpectedly with status 0, restarting
[1260179857] nsd[13521]: info: NSTATS 1260179857 1260179823 A=7 TYPE252=1
[1260179857] nsd[13521]: info: XSTATS 1260179857 1260179823 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=7 SFwdQ=0 SDupQ=0 SErr=0 RQ=8 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=7 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179857] nsd[13520]: warning: server 13521 died unexpectedly with status 0, restarting
[1260179857] nsd[13520]: warning: signal received, reloading...
[1260179857] nsd[13588]: info: memory recyclebin holds 0 bytes
[1260179869] nsd[13592]: info: NSTATS 1260179869 1260179857 TYPE252=1
[1260179869] nsd[13592]: info: XSTATS 1260179869 1260179857 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=0 SFwdQ=0 SDupQ=0 SErr=0 RQ=1 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=0 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179869] nsd[13591]: info: NSTATS 1260179869 1260179857 A=4 TYPE252=1
[1260179869] nsd[13591]: info: XSTATS 1260179869 1260179857 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=4 SFwdQ=0 SDupQ=0 SErr=0 RQ=5 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=4 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179869] nsd[13590]: info: NSTATS 1260179869 1260179857 A=4
[1260179869] nsd[13590]: info: XSTATS 1260179869 1260179857 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=4 SFwdQ=0 SDupQ=0 SErr=0 RQ=4 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=0 SFwdR=0 SFail=4 SFErr=0 SNaAns=0 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179869] nsd[13589]: info: NSTATS 1260179869 1260179857 A=3 TYPE252=1
[1260179869] nsd[13589]: info: XSTATS 1260179869 1260179857 RR=0 RNXD=0 RFwdR=0 RDupR=0 RFail=0 RFErr=0 RErr=0 RAXFR=0 RLame=0 ROpts=0 SSysQ=0 SAns=3 SFwdQ=0 SDupQ=0 SErr=0 RQ=4 RIQ=0 RFwdQ=0 RDupQ=0 RTCP=1 SFwdR=0 SFail=3 SFErr=0 SNaAns=1 SNXD=0 RUQ=0 RURQ=0 RUXFR=0 RUUpd=0
[1260179869] nsd[13520]: warning: Reload already in progress (pid = 13588)

# cat /etc/nsd/nsd.conf
# DO NOT EDIT! This file is auto pulled from git for node  ns0.scalaxy.ru
#

server:
    server-count: 4
    chroot: "/var/lib/nsd"
    ip-address: "*****"
    ip-address: 127.0.0.1
    hide-version: yes
    
    identity:   "SCALAXY-AUTHORITATIVE-DNS"
    database:   "/var/lib/nsd/var/db/nsd/nsd.db"
    logfile:    "/var/lib/nsd/var/log/nsd/nsd.log"
    pidfile:    "/var/lib/nsd/var/run/nsd.pid"
    zonesdir:   "/var/lib/nsd/etc/zones"
    difffile:   "/var/lib/nsd/var/db/nsd/ixfr.db"
    xfrdfile:   "/var/lib/nsd/var/db/nsd/xfrd.state"

zone:
        name: "XXXX.ru"
        zonefile: "/var/lib/nsd/etc/zones/master/XXXX.ru"
        notify: 94.198.49.174 NOKEY
        provide-xfr: 94.198.49.174 NOKEY
Comment 1 Wouter Wijngaards 2009-12-07 11:45:06 CET
Hi,

The line [1260179823] nsd[13450]: warning: server 13455 died unexpectedly with status 0, restarting
is the first error there, and (I think) the cause here.  It should not happen, could you start nsd with -d (do not fork in background) in gdb and give a stack-trace (command 'bt' in gdb)?

Or compile with --enable-debug (assertions) and see if an assertion failure gets triggered instead of using gdb?

If the server 13455 is not actually segfaulting, but calls the exit() function, erm, some different debugging is to be attempted.

Best regards,   Wouter
Comment 2 Wouter Wijngaards 2009-12-07 12:48:51 CET
Could it be because of CHEF's signal to the wrong process?
Change the
killproc -HUP $NSD_BIN
to
/usr/sbin/nsdc reload

nsdc sends the signal to the correct process.

Best regards,   Wouter
Comment 3 Alexey Ivanov 2009-12-07 14:13:13 CET
Thanks, that helped.

So it's now SuSe init-script bug.

PS. Can you make some kind of test for -HUP being sent to wrong process?
Comment 4 Wouter Wijngaards 2009-12-07 16:13:32 CET
Good that it is found.

The SIGHUP to the wrong process, well it can be tested and ignored instead of causing this.  This will however ignore the changes that Chef is trying to reload, probably not a good solution either.

Best regards,   Wouter
Comment 5 Alexey Ivanov 2009-12-07 18:38:57 CET
Maybe some message to log saying that HUP is sent to wrong process?
Comment 6 Wouter Wijngaards 2009-12-08 09:01:21 CET
Yes that could be possible.  Should not happen during normal usage and a warning may be in order.

Please note that the nsdc reload rereads the data-db, it does not reread the config file.  For that use nsdc restart, and I guess that is what Chef wants (edit the config file have changes picked up).

Best regards,    Wouter
Comment 7 Alexey Ivanov 2009-12-08 17:01:42 CET
Fixed initscript by rplacing

killproc -HUP $NSD_BIN

with

NSD_PID=/var/lib/nsd/var/run/nsd.pid
killproc -p $NSD_PID -HUP $NSD_BIN
Comment 8 Matthijs Mekking 2009-12-22 09:14:25 CET
(In reply to comment #4)
> Good that it is found.
> 
> The SIGHUP to the wrong process, well it can be tested and ignored instead of
> causing this.  This will however ignore the changes that Chef is trying to
> reload, probably not a good solution either.
> 
> Best regards,   Wouter

Imo, we can ignore a SIGHUP sending to wrong process. If you want to reload changes, use nsdc reload, or retrieve the process from nsd.pid (as suggested below).

To avoid race conditions, we dislike logging messages when a signal is received.
Comment 9 Matthijs Mekking 2010-02-17 09:30:04 CET
Has been fixed in 3.2.4