NSD4 TCP Performance

July 8, 2013 by wouter

For NSD 4 the TCP performance was optimised, with different socket handling compared to NSD 3. This article discusses a TCP performance test for NSD 4. In previous blog contributions, general (UDP) performance was measured and memory usage was analysed for NSD 4.

The TCP performance was measured by taking the average qps reported by the dnstcpbench tool from the PowerDNS source distribution. (Thanks for a great tool!) The timeout was set to 100 msec. On FreeBSD the system sends connection resets when a TCP connection cannot be established, and in this situation the tool overreports the qps. To mitigate this the qps was scaled back by multiplying by the fraction of succeeded tcp queries. The scaled back qps is close to the median qps that is also reported by the tool. For Linux, such scaling was not performed, and the average and median are close together.

You can click to enlarge these charts:


The highest TCP queries per second performance on Linux is about 14k qps by Yadifa, and then followed by NSD 4. On FreeBSD performance is higher, about 16k qps, and NSD 4 is fastest, with Knot and then Yadifa following with about 14k qps. Notice how Bind performs at 12k qps on FreeBSD, and 8k qps on Linux. PowerDNS remains at about the same speed. NSD 4 has higher TCP qps than NSD 3, on Linux and on FreeBSD.

On FreeBSD, software that cannot handle the load produces connection errors. The number of connection errors goes down when more threads are used by NSD 3, and Knot. For other software the thread count does not really influence this connection error count, but it does increase the qps performance. For Yadifa the qps performance degrades substantially when more threads are used, and it has a large number of connection errors because of that.  In general the connection errors are caused by a lack of performance, and a TCP connection cannot be established. Linux apparently deals differently with this (turns it into timeouts), this may cause some qps reporting differences between the OSes. In both cases the charts represent the average successful TCP qps.

The same pattern as for UDP can be seen with the number of threads, for NSD 4, the best Linux performance uses 2 cpu, and performance increases better and higher on FreeBSD, but the optimum here is 3 cpu instead of 4 cpu for UDP. Other software similarly benefits from more CPU power. It turns out that the PowerDNS option to get extra distribution threads adds UDP workers and not TCP workers, this is why performance does not scale up in these charts for PowerDNS. Yadifa performance goes down for both Linux and FreeBSD when more threads are in use.

The zone served in these experiments is a synthetic root zone (as used in previous tests), with 1 million random queries for unsigned delegations. PowerDNS uses its zonefile backend. The same test system(s) as used in the previous measurements are used, a PowerEdge 1950 with 4 cores at 2 GHz is running the DNS server.

NSD4 High Memory Usage

July 5, 2013 by wouter

NSD 4 is currently in beta and we are expecting a release candidate soon. This is the second of a series of blog-posts in which we describe some findings that may help you to optimize your NSD4 installation. In the first article we talked about general performance, this article muses about memory usage. (This article is based on the forthcoming nsd-4.0.0b5)

NSD4 Memory usage

The memory intensive architectural trade-off between pre-compiling answers and high speed serving of packets has been part of the NSD design since its first incarnation almost a decade ago.

With NSD 4 we continued the pre-compilation philosophy.  It even seems that, compared to NSD 3, NSD 4 uses more memory. Why? How?


Memory is being consumed to achieve speed improvements, but also for improvements such that administrators can update the zones served without needing the restart that so prominently featured NSD 3; NSD 4 can update, add, and remove zones without a restart. NSD 4 can receive IXFR (incremental zone transfers) and apply them in a time that depends on the size of that transfer, independent of zone size. Besides, during an update the database as stored on disk (nsd.db) is updated. On incremental updates of NSEC3 signed zones the nsec3-precompiled answers are all updated as well. All these features, that improve usability and speed imply that disk usage and memory usage increased compared to NSD 3.


To compare the memory a Dell PowerEdge 1950 with 8 GB of RAM, a large HDD and 2 GHz Xeon CPU (the same machine used for the performance tests earlier) was used to load the .NL zone (the authoritative Dutch top-level-domain) from June 2013. This is a fairly large zone, its zonefile is about 1.5 GB. It has 5.3 million delegations. It is signed with DNSSEC, and uses NSEC3 (opt-out), and has about 28% signed delegations. This means, with the nsec3 domains for the signed delegations, it has 5.3 * 1.28 = 6.8 million domain names with associated resource records.

The figure below shows the memory use of the daemon. ‘Rss’ represents the resident memory used by the daemon after starting. ‘Rss other’ is measured by tracking the total system memory usage. ‘Compiler’ represent the memory used by a zone compiler (if the software has such) and added onto it. This assumes you run the zone compiler and the DNS server on the same machine. If swap space is used we add it separately. Finally, the virtual memory usage (‘vsz extra’) is also added onto the bar, that entry reflects the size of the memory-mapped I/O to the nsd.db for NSD 4. Note that the memory-mapped I/O does not need to reside in core-memory.


We configured our measurement machine with 8 GB of RAM and we observe that the NL zone barely fits with NSD 4 (16 GB would be a better and realistic configuration). Bind and Yadifa can easily serve the zone from 8 GB of core memory. The zone compiler of Knot runs into swap space because it becomes very big. NSD 4 causes swap space to be used for a different reason, it (barely) fits in the 8 GB (about 7 GB), but its heavy use of memory mapped I/O causes the Linux kernel to make space in RAM by swapping other stuff to disk. The 8 GB of RAM is insufficient, you can start the daemon, but provisioning for operations it is too tight for common tasks, such as reloading the zone from zonefile and processing a (large) AXFR. However, because of it’s new design, NSD 4 could actually work in this RAM if it handled only relatively small IXFR updates.

The NSD 4 usage is the main daemon, plus a very small xfrd (xfrd now uses less than it did in NSD 3). The main daemon uses more memory for an increase in speed and also for better NSEC3 zone update processing. The virtual memory is the memory mapped nsd.db file. The kernel uses its virtual memory cache mechanism to handle this I/O, and you can provision for less than the total nsd.db file (at the cost of update processing speed). Realistic provisioning for NSD 4 here is about 10% of the virtual space to 100% of the virtual space, somewhere between 9 GB – 17 GB. It would be wise to add another multiple of memory on this for large zone changes, which because NSD keeps serving the old zone while it is busy setting up the new version of the zone, uses about twice the memory for that zone, so add another 6-7 Gb (the rss) for this (AXFR, zonefile change).

The NSD 3 usage is the base daemon, plus a xfrd (the other process), plus zonec. For continued operations another (same sized) chunk should be added for nsdc update, that updates the zonefiles and cleans up the nsd.db. This causes NSD 3 to use more memory in its provision than is necessary to run NSD 4 with a low disk I/O provision. This is because NSD 4 does not have zonec and nsdc update, this has been folded into the main daemon, and is performed during reload tasks (the daemon keeps serving DNS), and is what causes the disk structures to be much larger.

NSD 4 comes with a tool that tells you estimates for the size of RAM and disk needed for a zone. For this zone it indicates that 6.9 GB is used for RAM and 11 GB is used for nsd.db. The tool estimates that about 8 GB to about 17 GB could be used to run the NL zone (with 10% – 100% of the nsd.db memory mapped). As an aside, you build the nsd-mem tool by ‘make nsd-mem’ in the source repository.

NSEC3, memory and performance

NSD4 uses precompiled NSEC3 answers. Without pre-compilation of NSEC3, providing answers that proof the non-existence of a query (NXDOMAIN proof) involve a number of hash-calculations that bog down the performance of the name server. Obviously this precompiled data takes memory but results in NSD 4 answering queries much faster, as it is not CPU-bound by the nsec3 hashing. The precompilation means hashing all the names in the zone, something that takes 60-80 seconds on our measurement machine for the NL zone. In NSD 4 to handle new zone updates quickly, it keeps administration to incrementally update its precompiled NSEC3 data. This means IXFR updates to NSEC3 zones are handled by hashing the names affected by the update and not the entire zone. Note that NSD 4 does not allocate NSEC3 memory for NSEC (non-NSEC3) and unsigned zones, and this could make it use less memory than NSD 3 for non-NSEC3 zones.

If the NL zone was signed with NSEC, with the same key sizes, then the zonefile file would become 2.7 GB for the 5.3 million delegations. The memory usage goes up because there is no opt-out, but goes down because there is no nsec3 administration. The nsd-mem tool calculates 6.0 GB RAM and 10.6 GB disk usage and estimates 7.8 – 16.6 GB. This is nearly identical to the NSEC3 case, slightly less on the RAM and disk usage. NSD 3 uses 4.5 GB (rss) + 4.5 GB (other proc) + about 4 GB (zonec). And omitting the nsdc update usage this is already 13 GB for NSD 3.

Starting the server

In NSD 4 a restart of the daemon should only be necessary for system reasons (kernel updates). With its nsd-control tool you can change the other configuration on the fly without a restart. NSD 3 needed to zonec and restart the daemon to serve a new zone and NSD 4 does not need to do so.

This shows the speed of starting the daemon:

read_speedFor NSD 4 you can compile a new zone without restart, and while serving the old zone. Its zone compiler also has to write the 11 GB nsd.db to disk, and this makes it slower than the NSD 3 zone compiler (it is the same parser). The Knot compiler is likely from before its recent Ragel updates that speed it up. The initial start for NSD 4 measures the time to read the NL zone from the 11 GB nsd.db, this would happen after a system restart for example.

The stop time for NSD 3 and 4 is 0, below one second. For the other daemons this is curiously slow. But these numbers are very small compared to the system start numbers.

Thus, if you get a fresh zonefile and want to start, you can use the left bar for NSD 4, add up the two bars for NSD 3, and add up the two bars for Knot. For a system restart, the daemon start value gives the time needed to setup the daemon memory.

Summary: From NSD 3 to NSD 4

If you are running NSD 3 today and you do not experience any memory issues, such as extensive swapping, during the full serving-updating-zone-compiling cycle you should not experience any problems migrating to NSD 4. This is mainly due to the fact that what a significant fraction of the memory use in NSD 4, is memory-mapped to disk and is not accessed for serving answers to DNS queries.

However, we do advice you to run the nsd-mem tool that ships with NSD 4 to test your actual requirements. That will give you an exact calculation of your core-needs.


The software tested is NSD 4.0.0b5, NSD 3.2.15, Bind 9.9.2-P1, Knot 1.2.0, and Yadifa 1.0.2-2337. The OS is Linux 3.9, the file system is ext4 on hdd.

NSD4 Performance Measurements

by wouter

NSD 4 is currently in beta and we are expecting a release candidate soon. This is the first of a series of blog-posts in which we describe some findings that may help you to optimize your NSD4 installation. The article also serves as an explanation for differences that may show up in various benchmarks.

NSD4 Optimisation

The NSD4 code has been optimised: The latest beta(4.0.0b5) has a couple of optimizations (and beta bug fixes).  We tested the results of our efforts on NLnet Labs’ DISTEL testlab and performed a number of speed measurements.  Several other common open source nameservers are also tested.

A quick view on the results, the figures below show the query load in kqps (1000 of queries per second) for which the different nameserver implementations still manage to answer 100% of all queries. Higher queries rates lead to packets being dropped. Some servers had 99.9% responses on lower qps but then recovered to 100% queries answered at higher query rate. This may be the result of test measurement instability and was ignored.


Similar results as reported by the Knot and Yadifa teams are found, but delving deeper into the performance measurements reveals some subtleties in behavior that bias the results. Knot and Yadifa show very similar or better performance than NSD when they are configured on Linux based servers to use exactly 4 out of 4 cpus. Of course, this is strange and we searched further to look at what caused these outcomes, and it turns out to be related to the number of threads (and processes) and the choice of operating system.

NSD3 has about the same performance as Knot (on Linux). Yadifa is a little faster than Knot. NSD4 is faster than NSD3, and with optimizations in implemented in beta5 even more so.

Knot and Yadifa use a threaded model, where they have threads in one single process that service the DNS requests.  Bind can also be compiled with thread support, which was done here for comparison. It seems that Bind can scale up its performance in both Linux and FreeBSD with more threads, up to 3x more performance. NSD is different to the other implementations in that it uses processes that service the DNS requests instead of threads. This is where operating systems differences start to matter. Operating systems differ in their  threads and processes implementations and in their implementations of the network code.

FreeBSD can increase its packet output when the number of threads is increased, and it can also increase its packet output when the number of processes is increased. However, Linux treats threads and processes very differently, and in both cases, using a number of workers equal to the number of CPU cores is not optimal. The ksoftirqd Linux irq (interrupt) handler uses up the remainder of the four CPU cores when the server uses less than all four cores, likely the implementation of irq handling is a source of measured differences. Interrupts are caused by incoming packets and handling the interrupts from the network card under high load needs a lot of processing power. FreeBSD can push out more packets on the same hardware configuration, with the same software.

The optimal choice of the number of CPU cores to devote to DNS processing depends on the software. On FreeBSD, use as many cores as installed in the system. On Linux, use less than the total number of cores, 2 out of 4 cores for NSD. And 3 out of 4 cores for Yadifa. On Linux, Bind and Knot benefit from using 4 out of 4 cores.

Our Measurements: DISTEL Test Setup

Measurements were carried out using a modified DISTEL testlab setup. The configuration of the DISTEL testlab is with a number of (mostly identical) machines. There is a Player, a Server and a number of replay machines.


The player controls the action, this is scripted. The control is performed over ssh over the control network. The player starts the server software on the server machine, listening on the private LAN. A set of queries is replayed from the replay machines. The resulting replies are captured with tcpdump. Because the PowerEdge 1950 on the server is capable of replying with up to 140.000-160.000 qps on Linux and 220.000 qps on FreeBSD, multiple replay machines are necessary to send and record traffic to the server. Each replay machine sends 1/5 of the query traffic. This adds up to the total qps for the server machine. Test code on the replay machines measures if the actual sent packets correspond with the intended query rate (with timers). The test is run for a fixed time period, so that faster query rates still take the same time period. The maximum qps for this setup is around 430-440 kqps, but the measurements are up to 400kqps. Instability of the outcome seems to increase a little for the higher speeds (esp. for speeds above 350k which cause trouble for some (weaker) replay machines). In any case, the instability is several percent of the response rate percentage.

The detailed graphs for Linux 3.9 (click to enlarge):


The detailed graphs for FreeBSD 9.1 (click to enlarge):


The bar graphs at the beginning of this post are based on these detailed plots, analyzing where 100% responses occur.

The software tested is BIND 9.9.2-P1, NSD 3.2.15, NSD 4.0.0b4, NSD 4.0.0b5, Knot-1.2.0 and Yadifa 1.0.2-2337. The server hardware is a Dell PowerEdge 1950, 2 x 64-bit Intel Xeon CPU 2.0 GHz, thus 4 cores in total, 4 MB cache, 1333 MHz FSB. The Ethernet is the on-board Broadcom NetXtreme II BCM5708 1000Base-T interface. Settings are left at their defaults, if possible. The zone that is loaded is an artificial (test) root zone that contains around 500 delegations, it is not signed with DNSSEC. This zone is an old zone, created before the root was signed (and it is the same zone as previously used for measurements). The order of the queries is random and there are no queries that result in nxdomains.

Future work

In preparation for the release of NSD4 we are measuring the behavior for larger zones in terms of performance and memory usage.

Further Reading:


Wed Sep 25 2013

© Stichting NLnet Labs

Science Park 400, 1098 XH Amsterdam, The Netherlands

labs@nlnetlabs.nl, subsidised by NLnet and SIDN.