Bug 357 - NSDC restart and stop commands both fail to kill all NSD processes ---> new process can not start
NSDC restart and stop commands both fail to kill all NSD processes ---> ne...
Status: RESOLVED FIXED
Product: NSD
Classification: Unclassified
Component: NSD Code
3.2.x
i386 Linux
: P5 major
Assigned To: NSD team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-02-10 16:44 CET by Olafur Gudmundsson
Modified: 2013-06-27 16:25 CEST (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Olafur Gudmundsson 2011-02-10 16:44:41 CET
I have run into the issue that restart or stop +start commands to nsdc do not kill the running processes. 
I finally got around trying to expose this with a repeatable test. 
Here is what I see:
   nsdc <conffile> start 
   nsd.pid file exists 
   ps axu | grep nsd | grep user    --> 3 nsd processes running 

   nsdc <conffile> stop 
   nsd.pid file goes away 
   ps axu | grep nsd | grep user ---> 1 nsd process remains 

The remaining process is bound to the IP addresses thus a new one 
started will fail in starting. 
Waiting for process to die dies not seem to work, I gave up waiting
after 1 hour. 
Manual kill followed by start seems to work as a workaround 

This behaviour has been observed on FreeBSD-7.2 and 8.1 both 32 and 64 bit versions. I also see this behaviour on 64bit Fedora-13 and 64bit Ubuntu-10.10   
Test set-up to repeat this behaviour:
configure NSD as secondary for <= 1000 zones 
   in this case restart works most of the time 

configure NSD as secondary for >= 2000 zones 
   in this case restart/stop fails all the time 

(actually the bug seems to appear somewhere between 1250 and 1500 zones)

     Olafur
Comment 1 Wouter Wijngaards 2011-02-10 16:56:08 CET
Hi Olafur,

This looks like a racecondition bug that was fixed in NSD 3.2.7.  Are you using an older version?

Best regards, Wouter
Comment 2 Olafur Gudmundsson 2011-02-10 23:08:39 CET
I replied to Wouters's question via e-mail but my response has not shown up here in over 6  hours :-( 

The we used both nsd-3.2.6 and nsd-3.2.7 the behavior is the same. 


 Olafur
Comment 3 Wouter Wijngaards 2011-02-11 14:28:53 CET
Sorry about the email, you can email me directly if you prefer.  I have tried to recreate your situation, but my test with 10.000 secondary zones works well (it takes a couple seconds before it is fully stopped).  Can you (email) me more details or your test setup?  Would you like a copy of mine?

Best regards,   Wouter
Comment 4 Wouter Wijngaards 2011-02-23 18:58:00 CET
There is a fix in svn trunk r3167 which makes sure all processes exit when there are many zones.
Comment 5 Wouter Wijngaards 2011-03-02 16:28:17 CET
Closing the bug, just reopen if your test indicates further trouble.  And interested in the behaviour of the patch I sent you via email.

Best regards,   Wouter
Comment 6 Simon Arlott 2012-08-31 23:35:08 CEST
With version 3.2.9 and a slightly modified nsdc to report what's going on:
controlled_stop() {
        pid=$1
        try=1

        while [ $try -ne 0 ]; do
                if [ ${try} -gt 50 ]; then
                        echo "nsdc stop failed"
                        return 1
                else
ps aux|grep nsd
echo killing $pid
                        if [ $try -eq 1 ]; then
                                kill -TERM ${pid}
                        else
                                kill -TERM ${pid} >/dev/null 2>&1
                        fi

ps aux|grep nsd
                        # really stopped?
                        kill -0 ${pid} #>/dev/null 2>&1
                        if [ $? -eq 0 ]; then
                                controlled_sleep ${try}
                                try=`expr ${try} + 1`
                        else
                                try=0
                        fi
ps aux|grep nsd
                fi
        done

        return 0
}

When I restart nsdc, one of the processes is still running after the main process has exited.

root@blackhole:~# nsdc restart
nsd      11298  /usr/sbin/nsd -c /etc/nsd3/nsd.conf
nsd      11299  /usr/sbin/nsd -c /etc/nsd3/nsd.conf
nsd      11300  /usr/sbin/nsd -c /etc/nsd3/nsd.conf
nsd      11301  /usr/sbin/nsd -c /etc/nsd3/nsd.conf
nsd      11302  /usr/sbin/nsd -c /etc/nsd3/nsd.conf
nsd      11303  /usr/sbin/nsd -c /etc/nsd3/nsd.conf
killing 11298
nsd      11299  /usr/sbin/nsd -c /etc/nsd3/nsd.conf
/usr/sbin/nsdc: line 214: kill: (11298) - No such process
nsd      11299  /usr/sbin/nsd -c /etc/nsd3/nsd.conf

This frequently causes the new nsd to fail to bind its sockets and exit.

My nsd.conf contains:
server:
  server-count: 4

There are only 35 zones, but this happens on 5 servers running different versions of Linux.

send_children_quit() in server.c should wait for the child to exit before returning.
Comment 7 Matthijs Mekking 2013-06-27 16:25:35 CEST
I have committed your proposed fix in the NSD 3.2 branch (r3981). Apparently there was already a similar fix for this in trunk (NSD 4). NSD 3 will soon see a new patch release.