Bug 1272

Summary: NSD processes reforks
Product: NSD Reporter: Igor <admin>
Component: NSD CodeAssignee: NSD team <nsd-team>
Status: ASSIGNED ---    
Severity: normal CC: wouter
Priority: P5    
Version: 4.1.x   
Hardware: x86_64   
OS: FreeBSD   

Description Igor 2017-05-30 13:56:33 CEST
Hello!

I launch NSD 4.1.15 as a slave DNS service under FreeBSD-10.3.
My confusion is the significant timeouts in some responses during reforks of child NSD processes. The value of such timeouts are 200-400 milliseconds (probably 1-2 "long" timeout per hour).

I researched that the reforks are caused by incoming notify messages from masters. But i would like to know is this normal operation of NSD with such "long" timeouts when NSD reforks of its childs arise?

In my system 6 CPUs, 4 (or all the 6 right now) from this set i give to server-count option of NSD-config. 40Gb of RAM - more than enough. NSD was installed from FreeBSD packages as a precompiled version. Like a dirty hack  i have compiled my own NSD 4.1.15 with experimental memory MMAP (instead of malloc) support. In this solution the "long" timeouts have become much less during NSD reforks. So what is the proper solution for tuning of NSD to minimize these timeounts?

Thank you a lot for further help!
Comment 1 Wouter Wijngaards 2017-05-30 14:39:42 CEST
Hi Igor,

NSD should not have long timeouts during reforks.

If you run NSD with database: ""  (so no nsd.db is memory mapped), perhaps forking is faster?  That setting is a good setting to use anyway, the nsd.db uses up a lot of memory, but is not really needed.  NSD will write to the zonefiles once in a while when database is "", instead of having nsd.db memory mapped.

Are the masters sending too many notify messages?  You can restrict this per IP address?  When the masters are sending notify messages this uses the unix pipe between processes, that pipe is also used during the refork process.

Best regards, Wouter
Comment 2 Igor 2017-05-30 15:45:48 CEST
(In reply to Wouter Wijngaards from comment #1)
> Hi Igor,
> 
> NSD should not have long timeouts during reforks.
> 
> If you run NSD with database: ""  (so no nsd.db is memory mapped), perhaps
> forking is faster?  That setting is a good setting to use anyway, the nsd.db
> uses up a lot of memory, but is not really needed.  NSD will write to the
> zonefiles once in a while when database is "", instead of having nsd.db
> memory mapped.
> 
> Are the masters sending too many notify messages?  You can restrict this per
> IP address?  When the masters are sending notify messages this uses the unix
> pipe between processes, that pipe is also used during the refork process.
> 
> Best regards, Wouter

Thank you for your quick reply, Wouter!

Yes, i run NSD with database: "" option, no BIG nsd.db with precompiled responses. So is there reasonable MMAP usage in "dbless" version of service or i should revert it to standard malloc?

As for notify messages - they come from two masters (i.e. from two IPs) only. And my general question is about UNIX sockets (pipes) as IPC mechanism used in NSD: is there dependence of refork durability on zone size? Say, if zone size is about 500 megabytes or 5 millions RRs with NSEC3 support? As far as i could "dig" into source code of the NSD there might be zone rehash every childs refork, am i right?

If i see process states in TOP command, when long timeout is coming up, the processes state looks like:
last pid: 28106;  load averages:  0.90,  0.94,  0.89  up 1+02:47:12 16:47:16
37 processes:  2 running, 31 sleeping, 4 lock
Mem: 4126M Active, 7149M Inact, 2495M Wired, 22M Cache, 3310M Buf, 33G Free
Swap: 4096M Total, 4096M Free
PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME WCPU COMMAND
28097 bind          1  25    0     0K    16K CPU0    0   0:01 5.76% nsd
28288 bind          1  52    0  4808M  4760M select  5   5:59 4.59% nsd
28287 bind          1  20    0 48760K 27100K kqread  3   1:26 0.39% nsd
28101 bind          1  26    0  4808M  4760M *vm ob  5   0:00 0.00% nsd
28102 bind          1  26    0  4808M  4761M *vm ob  4   0:00 0.00% nsd
28104 bind          1  28    0  4808M  4761M *vm ob  5   0:00 0.00% nsd
28103 bind          1  27    0  4808M  4761M *vm ob  0   0:00 0.00% nsd

- maybe it somehow helps you
Comment 3 Wouter Wijngaards 2017-05-30 15:57:45 CEST
Hi Igor,

With the mmap option NSD uses mmap instead of malloc for memory alloction (with its special memory allocator).  It may be a bit faster (for allocation), but I did not think it would make (a lot of) difference for forks.  But for you it does, so there is something going on, but I don't know what.  Are you running something like memory compression?

The fork should be the same speed regardless of the number of zones and the size of those zones.  During the fork it opens and closes a couple file descriptors (pipes).  And writes a couple bytes on them.

If there are notifies during the fork, it'll wait until the notify has been processes (one packet), before it moves forwards.  In server.c there is a #define, that could perhaps lower the issue: #  define NUM_RECV_PER_SELECT 100
  if you lower that so say 10, does that make the issue go away?

From your top, it looks like the new process has 0 memory usage.  Something is up about swapping?  Or perhaps the kernel is copying memory instead of forks with copy-on-write?  Other copy-on-write (that type) of settings in the kernel, maybe?  It does not seem to be waiting for the old process (notify processing), but somehow creating a new process takes time.

There was a similar issue a while ago where due to the pattern of forks and long uptime, linux would create an infinitely long linked list of memory maps (for every fork there would be an entry, but that entry did not use memory in top).  To fix for that I changed the fork-ordering; but perhaps FreeBSD also has such an issue?  The very long list caused the kernel to take a lot of time processing the vm mmap table.

But I also see no swap used and 33G free.  Is this something about the VM (I see the '*vm ob' state) ?  NSD attempts to perform a copy-on-write fork, where it has the same memory pages as the parent, that should be really fast, and not visible with 0 memory?

Best regards, Wouter
Comment 4 Igor 2017-05-30 17:07:52 CEST
(In reply to Wouter Wijngaards from comment #3)
> Hi Igor,
> 
> With the mmap option NSD uses mmap instead of malloc for memory alloction
> (with its special memory allocator).  It may be a bit faster (for
> allocation), but I did not think it would make (a lot of) difference for
> forks.  But for you it does, so there is something going on, but I don't
> know what.  Are you running something like memory compression?
> 
> The fork should be the same speed regardless of the number of zones and the
> size of those zones.  During the fork it opens and closes a couple file
> descriptors (pipes).  And writes a couple bytes on them.
> 
> If there are notifies during the fork, it'll wait until the notify has been
> processes (one packet), before it moves forwards.  In server.c there is a
> #define, that could perhaps lower the issue: #  define NUM_RECV_PER_SELECT
> 100
>   if you lower that so say 10, does that make the issue go away?
> 
> From your top, it looks like the new process has 0 memory usage.  Something
> is up about swapping?  Or perhaps the kernel is copying memory instead of
> forks with copy-on-write?  Other copy-on-write (that type) of settings in
> the kernel, maybe?  It does not seem to be waiting for the old process
> (notify processing), but somehow creating a new process takes time.
> 
> There was a similar issue a while ago where due to the pattern of forks and
> long uptime, linux would create an infinitely long linked list of memory
> maps (for every fork there would be an entry, but that entry did not use
> memory in top).  To fix for that I changed the fork-ordering; but perhaps
> FreeBSD also has such an issue?  The very long list caused the kernel to
> take a lot of time processing the vm mmap table.
> 
> But I also see no swap used and 33G free.  Is this something about the VM (I
> see the '*vm ob' state) ?  NSD attempts to perform a copy-on-write fork,
> where it has the same memory pages as the parent, that should be really
> fast, and not visible with 0 memory?
> 
> Best regards, Wouter

Thanks for detailed explenation, Wouter!

I will try to research operation of my service according to your remarks and answer you later.
Comment 5 Igor 2017-06-05 12:15:41 CEST
Hello, Wouter!

I have reserched the problem in case of possible notifies during NSD reforks. I dumped transfers over tcp. And in the captured tcp traffic i saw some unknown (for me) things, for example when slave NSD initiating tcp-connection to master BIND:

12:04:37.324508 IP 192.168.255.250.45960 > 192.168.255.250.53: Flags [S], seq 2193962058, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 1546614523 ecr 0], length 0
12:04:37.324532 IP 192.168.255.250.53 > 192.168.255.250.45960: Flags [S.], seq 72089075, ack 2193962059, win 65535, options [mss 16344,nop,wscale 6,sackOK,TS val 3536168760 ecr 1546614523], length 0
12:04:37.324544 IP 192.168.255.250.45960 > 192.168.255.250.53: Flags [.], ack 1, win 1275, options [nop,nop,TS val 1546614523 ecr 3536168760], length 0
12:04:37.324643 IP 192.168.255.250.45960 > 192.168.255.250.53: Flags [P.], seq 1:3, ack 1, win 1275, options [nop,nop,TS val 1546614523 ecr 3536168760], length 2
12:04:37.430332 IP 192.168.255.250.53 > 192.168.255.250.45960: Flags [.], ack 3, win 1275, options [nop,nop,TS val 3536168867 ecr 1546614523], length 0
12:04:37.430409 IP 192.168.255.250.45960 > 192.168.255.250.53: Flags [P.], seq 3:104, ack 1, win 1275, options [nop,nop,TS val 1546614630 ecr 3536168867], length 1010 [b2&3=0x1] [1a] [0q] [1394au] (99)

From where we can see the first PUSH+ACK 2-bytes packet (time 12:04:37.324643) from NSD, then ACK from BIND (time 12:04:37.430332), i.e. BIND replied after about 100 milliseconds. And then the second PUSH+ACK from NSD with 99 bytes packet (time 12:04:37.430409).

RFC7766 says the following:
 DNS clients and servers SHOULD pass the two-octet length field, and
   the message described by that length field, to the TCP layer at the
   same time (e.g., in a single "write" system call) to make it more
   likely that all the data will be transmitted in a single TCP segment.

So it just service SHOULD send message in a single TCP-segment (as it could be seen in case of "slave BIND <---> master BIND"), not MUST.

Therefore i decided to search for solution my own. And this solution is a simple patch written by me for original xfrd-tcp.c:
--- /root/install/nsd-4.1.15/xfrd-tcp.c 2017-01-19 18:32:27.000000000 +0300
+++ /root/install/nsd-4.1.15_patched/xfrd-tcp.c 2017-06-05 11:52:27.966434000 +0300
@@ -538,6 +538,7 @@
        }

        if(xfrd->nsd->outgoing_tcp_mss > 0) {
+
 #if defined(IPPROTO_TCP) && defined(TCP_MAXSEG)
                if(setsockopt(fd, IPPROTO_TCP, TCP_MAXSEG,
                        (void*)&xfrd->nsd->outgoing_tcp_mss,
@@ -640,6 +641,18 @@
 {
        ssize_t sent;

+#if defined(RFC_7766)
+       /* according to RFC7766 we SHOULD send DNS packets
+        * in a single TCP-segment during transaction
+         * over TCP
+         */
+        int i = 1;
+       if(setsockopt(tcp->fd, IPPROTO_TCP, TCP_NOPUSH, &i, sizeof(i)) < 0) {
+               log_msg(LOG_ERR, "xfrd: setsockopt(TCP_NOPUSH)"
+                               "failed: %s", strerror(errno));
+       }
+#endif
+
        if(tcp->total_bytes < sizeof(tcp->msglen)) {
                uint16_t sendlen = htons(tcp->msglen);
                sent = write(tcp->fd,
@@ -668,6 +681,16 @@
        sent = write(tcp->fd,
                buffer_current(tcp->packet),
                buffer_remaining(tcp->packet));
+
+#if defined(RFC_7766)
+        i = 0;
+        if(setsockopt(tcp->fd, IPPROTO_TCP, TCP_NOPUSH, &i, sizeof(i)) < 0) {
+               log_msg(LOG_ERR, "xfrd: setsockopt(TCP_NOPUSH)"
+                               "failed: %s", strerror(errno));
+       }
+       /* Ok. end of RFC7766 "constraints" */
+#endif
+
        if(sent == -1) {
                if(errno == EAGAIN || errno == EINTR) {
                        /* write would block, try later */


in order the NSD to be able to send message in a single TCP-segment and so to avoid at least such 100-millisecond delays. This patch successfully works.

So i would like to hear your recommendation about can i use the patch from the point of view of service security, stability and reliability?
Comment 6 Wouter Wijngaards 2017-06-06 14:27:31 CEST
Hi Igor,

Thank you for the patch!  To implement this I implemented it identically to how queries already do this, with a call to writev(2).  That should have a similar effect but avoids the socket option calls.  My patch below (A bit more code).  A good fix to have, thanks for the report!

Best regards, Wouter

Index: xfrd-tcp.c
===================================================================
--- xfrd-tcp.c	(revision 4752)
+++ xfrd-tcp.c	(working copy)
@@ -642,9 +642,18 @@
 
 	if(tcp->total_bytes < sizeof(tcp->msglen)) {
 		uint16_t sendlen = htons(tcp->msglen);
+#ifdef HAVE_WRITEV
+		struct iovec iov[2];
+		iov[0].iov_base = (uint8_t*)&sendlen + tcp->total_bytes;
+		iov[0].iov_len = sizeof(sendlen) - tcp->total_bytes;
+		iov[1].iov_base = buffer_begin(tcp->packet);
+		iov[1].iov_len = buffer_limit(tcp->packet);
+		sent = writev(tcp->fd, iov, 2);
+#else /* HAVE_WRITEV */
 		sent = write(tcp->fd,
 			(const char*)&sendlen + tcp->total_bytes,
 			sizeof(tcp->msglen) - tcp->total_bytes);
+#endif /* HAVE_WRITEV */
 
 		if(sent == -1) {
 			if(errno == EAGAIN || errno == EINTR) {
@@ -656,11 +665,19 @@
 		}
 
 		tcp->total_bytes += sent;
+		if(sent > (ssize_t)sizeof(tcp->msglen))
+			buffer_skip(tcp->packet, sent-sizeof(tcp->msglen));
 		if(tcp->total_bytes < sizeof(tcp->msglen)) {
 			/* incomplete write, resume later */
 			return 0;
 		}
-		assert(tcp->total_bytes == sizeof(tcp->msglen));
+#ifdef HAVE_WRITEV
+		if(tcp->total_bytes == tcp->msglen + sizeof(tcp->msglen)) {
+			/* packet done */
+			return 1;
+		}
+#endif
+		assert(tcp->total_bytes >= sizeof(tcp->msglen));
 	}
 
 	assert(tcp->total_bytes < tcp->msglen + sizeof(tcp->msglen));
Comment 7 Igor 2017-06-07 10:33:36 CEST
Hello!

Big thanks for your version of patch, Wouter! It looks like more general and usefull not regarding under what operating system we are applying this patch to NSD. In my version i planned to insert multiple #ifdef's for checking existance of TCP_NOPUSH (for FreeBSD), TCP_CORK (for Linux) socket options and some tricky code for Solaris 10 ;) So if you don't mind i will take your version of patch to apply it on all of my production NSD services.