Bug 755 - NSD spins after a zone update
NSD spins after a zone update
Product: NSD
Classification: Unclassified
Component: NSD Code
x86_64 Linux
: P5 normal
Assigned To: NSD team
Depends on:
  Show dependency treegraph
Reported: 2016-04-18 10:57 CEST by Michał Kępień
Modified: 2016-04-21 11:02 CEST (History)
1 user (show)

See Also:

Excerpts of top and pstree output (3.24 KB, text/plain)
2016-04-18 10:57 CEST, Michał Kępień
Output of GDB's "bt full" (4.78 KB, text/plain)
2016-04-18 10:58 CEST, Michał Kępień

Note You need to log in before you can comment on or make changes to this bug.
Description Michał Kępień 2016-04-18 10:57:25 CEST
Created attachment 325 [details]
Excerpts of top and pstree output

Version Information:

  * OS: Red Hat Enterprise Linux 6.7
  * NSD: nsd-4.1.7-3.el6.x86_64
  * libevent: libevent-1.4.13-4.el6.x86_64
  * kernel: kernel-2.6.32-573.22.1.el6.x86_64

First of all, I suspect the root cause for the issue described below might be a libevent bug, but perhaps there is also room for improvement on the NSD side, so I thought there was no harm in reporting what we had observed.

After three weeks of flawless performance with several dozens of zone updates every day, NSD got stuck after a seemingly innocuous zone update.  By "stuck", I mean:

  * 15 processes (judging by the PIDs, the newly-spawned ones) spinning on a write() syscall to a Unix socket, which kept returning EAGAIN over and over,
  * 1 process (judging by the PID, a to-be-replaced one) spinning in user space, inside libevent,
  * server kept answering queries, using the last properly reloaded version of the zone,
  * server kept receiving notifies for further zone updates, but none of them resulted in replacing the zone with a more recent version.

I have attached output excerpts from top and pstree.  An excerpt of strace attached to one of the spinning processes follows:

    # strace -p 5035 2>&1 | head
    Process 5035 attached
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)
    write(50, "\6\0\0\0", 4)                = -1 EAGAIN (Resource temporarily unavailable)

File descriptor 50 was a Unix socket:

    # ls -la /proc/5035/fd/50
    lrwx------. 1 root root 64 04-18 08:12 50 -> socket:[905738346]
    # netstat -a -n -p | grep 905738346
    unix  3      [ ]         STREAM     CONNECTED     905738346 5035/nsd

Meanwhile, process 4984 was stuck in userspace (full GDB backtrace will be attached shortly).

Nothing interesting appeared in the logs with verbosity set to 2.

I had to restart NSD shortly after collecting the diagnostic information posted above.  In case anything else code-related is required to diagnose this issue, I would be grateful if you advised me what it is, so that I am able to collect it when this issue reoccurs.  So far, it has only been observed once.
Comment 1 Michał Kępień 2016-04-18 10:58:41 CEST
Created attachment 326 [details]
Output of GDB's "bt full"
Comment 2 Wouter Wijngaards 2016-04-18 11:03:58 CEST
Hi Michal,

Thank you for the data.  Collecting more data could be done with a compile that has --enable-checking enabled (the -F and -L commandline options are available, you can see with nsd -h).  Then starting NSD with  -F 60 -L 2   -- this enabled tracing of the interprocess communication that happens over that pipe.  This is the pipe the processes are waiting for and what the busywaiting other process is doing seems to be what could be the issue.

I'll go over the data you posted now.

Best regards, Wouter
Comment 3 Michał Kępień 2016-04-18 15:42:16 CEST
Hi Wouter,

Thanks for a quick response.

If you really meant -F 60, I'm afraid the amount of logs it generates makes it impossible for me to enable it in the long run on the server in question.  However, -F 40 seems to be enough to enable IPC debugging and its overhead is negligible, is that enough or do you really need xfrd logging as well?
Comment 4 Wouter Wijngaards 2016-04-18 15:44:52 CEST
Hi Michal,

The -F 40 is likely enough.  -F 60 is because this started as a look into xfrd because nsd was not servicing the notifies.  But the IPC part seems most interesting.  Is this bug getting serviced as a libevent bug?  (does it look like it is one? in which case, that'll probably explain it all; and all I can do is think about robustness code).

Best regards, Wouter
Comment 5 Michał Kępień 2016-04-19 12:44:25 CEST
Hi Wouter,

I have not yet consulted this issue with libevent developers, though I found two threads discussing issues exhibiting similar symptoms, i.e. an infinite loop inside timeout_process():


The first case was inconclusive (Nick Mathewson wrote: "if EVLIST_TIMEOUT is not set on the event's ev_flags field, the event shouldn't be in the timeout heap in the first place").  The other report's author explicitly admitted that their code was at fault, not libevent's.

I cannot confirm that erroneous event flags were indeed the culprit in my case.  However, I tried to simulate such a scenario by attaching a debugger to a running NSD process, setting a breakpoint in timeout_process() and opening a TCP connection to NSD.  Once the event became active (after tcp-timeout seconds) and the breakpoint fired, I cleared the EV_TIMEOUT bit in the ev_flags field of the timeout event and then detached the debugger.  What followed closely resembled the symptoms I originally reported: one process kept spinning in userspace, the server kept answering queries, but zone updates were not happening any more.

Hopefully, the above experiment will be helpful in pinpointing the root cause.  Meanwhile, I can run NSD with -F 40 -L 2 and wait for the issue to reoccur.  Once that happens, I'll dig a bit deeper in the running process' libevent structures; last time I only noticed this could be related to libevent after restarting NSD.
Comment 6 Wouter Wijngaards 2016-04-19 15:10:59 CEST
Hi Michal,

Thank you very much for the debugging.  I searched though Gbs of logs and your traces and then I found in the code this fix:

Index: server.c
--- server.c	(revision 4620)
+++ server.c	(working copy)
@@ -2422,7 +2422,10 @@
 	if (slowaccept || data->nsd->current_tcp_count == data->nsd->maximum_tcp_count) {
-		slowaccept = 0;
+		if(slowaccept) {
+			event_del(&slowaccept_event);
+			slowaccept = 0;
+		}
 	assert(data->nsd->current_tcp_count >= 0);

The event_del was missing here, and in a case where the second part of the containing if triggers it could then set the slowaccept to 0 without the event-del.  Which later on could cause an event_set on an already set event (but only without its timeout of a second) and then cause the infinite loop in libevent.

Does the patch resolve your issue?  The code is also available from the code repository.

Best regards, Wouter
Comment 7 Michał Kępień 2016-04-21 11:02:21 CEST
Hi Wouter,

The patch works indeed, thank you very much for providing it so swiftly.

For posterity's sake, I thought I would elaborate on this bug a bit more as I believe it is caused neither by the second part of the containing "if" nor by calling event_set() on an already set event.

Below you'll find the scenario I believe happened on our server.  An important prerequisite is that the maximum number of open file descriptors (kernel default or taken from limits.conf or set using "ulimit -n") has to be smaller than tcp-count in nsd.conf.

 1. Incoming TCP connections fill up the available pool of open file descriptors for a process (PID 4984 in this case).  slowaccept is set to 1, a two-second slowaccept timeout is set.  Further TCP connections are queued by the OS, but not accepted by NSD.

 2. Two seconds pass without any file descriptor getting closed.  The slowaccept timeout fires, changing slowaccept to 0 (handle_slowaccept_timeout), but it almost immediately gets changed back to 1 (handle_tcp_accept) as there is still no "space" in the open file descriptor pool.

 3. Step 2 is repeated an arbitrary number of times. (This is harmless so far.) Essentially, while there are further TCP connections queued, there is always a slowaccept timeout event pending in libevent's timeout queue.

 4. One of the TCP connections gets closed, freeing up a file descriptor.  slowaccept is reset to 0 [1], but the timeout event isn't removed from libevent's timeout queue (this is the root cause, as you already figured out).

 5. One of the queued TCP connections get accepted, but due to further TCP connections being queued, accept() returns -1 for the next one, causing another copy of the slowaccept timeout event (pointing to the same memory) to be inserted into libevent's timeout queue. [2]

 6. The two-second timeout of the slowaccept event is triggered.  timeout_process() is called, which calls event_del().  As the EVLIST_TIMEOUT bit is set for the slowaccept event, event_queue_remove() is called, which clears the EVLIST_TIMEOUT bit for _all_ copies of the slowaccept event while only removing _one_ copy from libevent's timeout queue.

 7. Another iteration of the loop inside timeout_process() encounters a copy of the slowaccept event.  As all copies of that event share a common ev_timeout value, libevent determines that this copy should also be marked as active.  event_del() is called for this copy, but as the latter does not have the EVLIST_TIMEOUT bit set, it is _not_ removed from libevent's timeout queue.

The last step is then infinitely repeated.

At the time I gathered the diagnostic data, the open file descriptor count for process 4984 was equal to the soft limit for that value.  I don't think zone transfers had anything to do with this issue, they were just impacted due to a child process spinning.

If tcp-count in nsd.conf is lower than the maximum number of open file descriptors, everything works fine because when current_tcp_count >= maximum_tcp_count, handle_tcp_accept() returns without even attempting to call accept() and thus the slowaccept timeout event is never created.

[1] Note that this is done due to the first part of the containing "if" in cleanup_tcp_handler() being true, not the second part.

[2] Note that steps 4-5 can happen multiple times if multiple TCP connections are closed and there are still further TCP connections queued.  However, all it takes to trigger this bug is to cause two copies of the slowaccept timeout event to be simultaneously present in libevent's timeout queue.