Bug 1749 - With harden-referral-path, performance drops abruptly after 20-30 minutes
With harden-referral-path, performance drops abruptly after 20-30 minutes
Status: ASSIGNED
Product: unbound
Classification: Unclassified
Component: server
unspecified
x86_64 Linux
: P5 enhancement
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-10-15 00:09 CEST by Jacob Hoffman-Andrews
Modified: 2017-11-07 09:42 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jacob Hoffman-Andrews 2017-10-15 00:09:37 CEST
I'm using Unbound 1.6.7 with harden-referral-path on RHEL 7.4. I'm load testing it using https://github.com/jsha/go/blob/master/caa-lookups/lookups.go with this command line:

$ caa-lookups -checkA -checkCAA -parallel 500 -timeout 3s -server ub16.crud.net:53 -proto udp > /dev/null < shuffled-names-list.txt

shuffled-names-list.txt is a randomized list of domain names from the Certificate Transparency logs for certificates issued by Let's Encrypt.

I get about 300 qps with this setup. However, after about 20-30 minutes, response time abruptly gets a lot slower. The log contains some messages about unwanted reply threshold and out of memory. Below is an excerpt of the logs from right before the issue until right after. I've also confirmed that disabling harden-referral-path allows the server to run indefinitely without encountering this problem.

Oct 14 21:30:35 unbound[8613:1] info: server stats for thread 1: 37410 queries, 10093 answers from cache, 27317 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Oct 14 21:30:35 unbound[8613:1] info: server stats for thread 1: requestlist max 5529 avg 5305.8 exceeded 0 jostled 0
Oct 14 21:30:35 unbound[8613:1] info: average recursion processing time 3.605590 sec
Oct 14 21:30:35 unbound[8613:1] info: histogram of recursion processing times
Oct 14 21:30:35 unbound[8613:1] info: [25%]=0.139729 median[50%]=0.323079 [75%]=0.650856
Oct 14 21:30:35 unbound[8613:1] info: lower(secs) upper(secs) recursions
Oct 14 21:30:35 unbound[8613:1] info:    0.000000    0.000001 3221
Oct 14 21:30:35 unbound[8613:1] info:    0.004096    0.008192 1
Oct 14 21:30:35 unbound[8613:1] info:    0.016384    0.032768 15
Oct 14 21:30:35 unbound[8613:1] info:    0.032768    0.065536 548
Oct 14 21:30:35 unbound[8613:1] info:    0.065536    0.131072 2693
Oct 14 21:30:35 unbound[8613:1] info:    0.131072    0.262144 5568
Oct 14 21:30:35 unbound[8613:1] info:    0.262144    0.524288 7079
Oct 14 21:30:35 unbound[8613:1] info:    0.524288    1.000000 5308
Oct 14 21:30:35 unbound[8613:1] info:    1.000000    2.000000 1551
Oct 14 21:30:35 unbound[8613:1] info:    2.000000    4.000000 339
Oct 14 21:30:35 unbound[8613:1] info:    4.000000    8.000000 100
Oct 14 21:30:35 unbound[8613:1] info:    8.000000   16.000000 37
Oct 14 21:30:35 unbound[8613:1] info:   16.000000   32.000000 194
Oct 14 21:30:35 unbound[8613:1] info:   32.000000   64.000000 261
Oct 14 21:30:35 unbound[8613:1] info:   64.000000  128.000000 280
Oct 14 21:30:35 unbound[8613:1] info:  128.000000  256.000000 141
Oct 14 21:30:35 unbound[8613:1] info:  256.000000  512.000000 40
Oct 14 21:30:35 unbound[8613:1] info:  512.000000 1024.000000 6
Oct 14 21:30:35 unbound[8613:1] info: 2048.000000 4096.000000 1
Oct 14 21:30:38 unbound[8613:0] info: validation failure <dns2.megahost.ro. A IN>: no signatures from 46.102.235.250
Oct 14 21:30:38 unbound[8613:0] info: validation failure <dns1.megahost.ro. A IN>: no signatures from 46.102.235.250
Oct 14 21:30:39 unbound[8613:1] warning: unwanted reply total reached threshold (10000) you may be under attack. defensive action: clearing the cache
Oct 14 21:31:22 unbound[8613:0] info: validation failure <ccp-dreux.fr. A IN>: No DNSKEY record for key ccp-dreux.fr. while building chain of trust
Oct 14 21:31:22 unbound[8613:0] info: validation failure <ccp-dreux.fr. NS IN>: No DNSKEY record for key ccp-dreux.fr. while building chain of trust
Oct 14 21:31:25 unbound[8613:0] info: server stats for thread 0: 15603 queries, 3788 answers from cache, 11815 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Oct 14 21:31:25 unbound[8613:0] info: server stats for thread 0: requestlist max 9863 avg 5551.3 exceeded 0 jostled 0
Oct 14 21:31:25 unbound[8613:0] info: average recursion processing time 7.466609 sec
Oct 14 21:31:25 unbound[8613:0] info: histogram of recursion processing times
Oct 14 21:31:25 unbound[8613:0] info: [25%]=0.135983 median[50%]=0.337766 [75%]=0.854599
Oct 14 21:31:25 unbound[8613:0] info: lower(secs) upper(secs) recursions
Oct 14 21:31:25 unbound[8613:0] info:    0.000000    0.000001 1223
Oct 14 21:31:25 unbound[8613:0] info:    0.008192    0.016384 2
Oct 14 21:31:25 unbound[8613:0] info:    0.016384    0.032768 10
Oct 14 21:31:25 unbound[8613:0] info:    0.032768    0.065536 392
Oct 14 21:31:25 unbound[8613:0] info:    0.065536    0.131072 1223
Oct 14 21:31:25 unbound[8613:0] info:    0.131072    0.262144 2242
Oct 14 21:31:25 unbound[8613:0] info:    0.262144    0.524288 2690
Oct 14 21:31:25 unbound[8613:0] info:    0.524288    1.000000 1469
Oct 14 21:31:25 unbound[8613:0] info:    1.000000    2.000000 345
Oct 14 21:31:25 unbound[8613:0] info:    2.000000    4.000000 173
Oct 14 21:31:25 unbound[8613:0] info:    4.000000    8.000000 190
Oct 14 21:31:25 unbound[8613:0] info:    8.000000   16.000000 565
Oct 14 21:31:25 unbound[8613:0] info:   16.000000   32.000000 696
Oct 14 21:31:25 unbound[8613:0] info:   32.000000   64.000000 159
Oct 14 21:31:25 unbound[8613:0] info:   64.000000  128.000000 212
Oct 14 21:31:25 unbound[8613:0] info:  128.000000  256.000000 110
Oct 14 21:31:25 unbound[8613:0] info:  256.000000  512.000000 27
Oct 14 21:31:25 unbound[8613:0] info:  512.000000 1024.000000 8
Oct 14 21:31:27 unbound[8613:0] info: validation failure <www.guildcraft.nl. A IN>: No DNSKEY record for key guildcraft.nl. while building chain of trust
Oct 14 21:31:38 unbound[8613:1] info: server stats for thread 1: 5056 queries, 718 answers from cache, 4338 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Oct 14 21:31:38 unbound[8613:1] info: server stats for thread 1: requestlist max 8934 avg 6366.57 exceeded 0 jostled 0
Oct 14 21:31:38 unbound[8613:1] info: average recursion processing time 15.841820 sec
Oct 14 21:31:38 unbound[8613:1] info: histogram of recursion processing times
Oct 14 21:31:38 unbound[8613:1] info: [25%]=0.382482 median[50%]=6.65974 [75%]=14.3008
Oct 14 21:31:38 unbound[8613:1] info: lower(secs) upper(secs) recursions
Oct 14 21:31:38 unbound[8613:1] info:    0.000000    0.000001 208
Oct 14 21:31:38 unbound[8613:1] info:    0.016384    0.032768 1
Oct 14 21:31:38 unbound[8613:1] info:    0.032768    0.065536 45
Oct 14 21:31:38 unbound[8613:1] info:    0.065536    0.131072 185
Oct 14 21:31:38 unbound[8613:1] info:    0.131072    0.262144 404
Oct 14 21:31:38 unbound[8613:1] info:    0.262144    0.524288 464
Oct 14 21:31:38 unbound[8613:1] info:    0.524288    1.000000 329
Oct 14 21:31:38 unbound[8613:1] info:    1.000000    2.000000 74
Oct 14 21:31:38 unbound[8613:1] info:    2.000000    4.000000 146
Oct 14 21:31:38 unbound[8613:1] info:    4.000000    8.000000 385
Oct 14 21:31:38 unbound[8613:1] info:    8.000000   16.000000 1177
Oct 14 21:31:38 unbound[8613:1] info:   16.000000   32.000000 438
Oct 14 21:31:38 unbound[8613:1] info:   32.000000   64.000000 136
Oct 14 21:31:38 unbound[8613:1] info:   64.000000  128.000000 134
Oct 14 21:31:38 unbound[8613:1] info:  128.000000  256.000000 72
Oct 14 21:31:38 unbound[8613:1] info:  256.000000  512.000000 21
Oct 14 21:31:38 unbound[8613:1] info:  512.000000 1024.000000 5
Oct 14 21:31:50 unbound[8613:1] error: Could not generate request: out of memory
Oct 14 21:31:50 unbound[8613:1] error: mem error generating DS request
Oct 14 21:31:50 unbound[8613:1] error: Could not generate request: out of memory
Oct 14 21:31:50 unbound[8613:1] error: mem error generating DS request
Oct 14 21:31:56 unbound[8613:0] warning: unwanted reply total reached threshold (10000) you may be under attack. defensive action: clearing the cache
Oct 14 21:32:25 unbound[8613:0] info: validation failure <ns2.bofh.lv. A IN>: signature expired from 91.224.1.10 for key bofh.lv. while building chain of trust
Oct 14 21:32:26 unbound[8613:0] info: server stats for thread 0: 2600 queries, 0 answers from cache, 2600 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Oct 14 21:32:26 unbound[8613:0] info: server stats for thread 0: requestlist max 8709 avg 6600.49 exceeded 0 jostled 0
Oct 14 21:32:26 unbound[8613:0] info: average recursion processing time 30.815923 sec
Oct 14 21:32:26 unbound[8613:0] info: histogram of recursion processing times
Oct 14 21:32:26 unbound[8613:0] info: [25%]=9.54277 median[50%]=13.6219 [75%]=23.5955
Oct 14 21:32:26 unbound[8613:0] info: lower(secs) upper(secs) recursions
Oct 14 21:32:26 unbound[8613:0] info:    0.262144    0.524288 2
Oct 14 21:32:26 unbound[8613:0] info:    0.524288    1.000000 3
Oct 14 21:32:26 unbound[8613:0] info:    1.000000    2.000000 8
Oct 14 21:32:26 unbound[8613:0] info:    2.000000    4.000000 120
Oct 14 21:32:26 unbound[8613:0] info:    4.000000    8.000000 308
Oct 14 21:32:26 unbound[8613:0] info:    8.000000   16.000000 1391
Oct 14 21:32:26 unbound[8613:0] info:   16.000000   32.000000 623
Oct 14 21:32:26 unbound[8613:0] info:   32.000000   64.000000 80
Oct 14 21:32:26 unbound[8613:0] info:   64.000000  128.000000 155
Oct 14 21:32:26 unbound[8613:0] info:  128.000000  256.000000 107
Oct 14 21:32:26 unbound[8613:0] info:  256.000000  512.000000 28
Oct 14 21:32:26 unbound[8613:0] info:  512.000000 1024.000000 11
Oct 14 21:32:26 unbound[8613:0] info: 1024.000000 2048.000000 1


unbound -V output:

Version 1.6.6
linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.1e-fips 11 Feb 2013
linked modules: dns64 ipsecmod subnetcache respip validator iterator
Comment 1 Wouter Wijngaards 2017-10-16 09:16:10 CEST
Hi Jacob,

I think all of this is because of the out of memory.  Probably also why it loses the context for outstanding queries and therefore categorizes the replies as unwanted.  The counter that high is very unusual!  This is because port randomisation opens only a small set of ports on the machine, even if a large flood is directed at it, only some get counted.

Is the config for that server for more cache than there is memory in the machine?  Or are you running (eg. OpenBSD) with a very low ulimit for memory?  Other people configure Gbs of cache but ulimit says 256M of heap allowed, and get this kind of problem.  Adjust config and ulimit values to fit.

Likely sorting out the out of memory will also increase your qps.  Because the query gets answered instead of dropped and requeried.

Best regards, Wouter
Comment 2 Jacob Hoffman-Andrews 2017-10-16 19:07:21 CEST
Whoops, I meant to include my config. Here it is. Your answer about out of memory triggering the problem makes sense. I have a very small cache-max-ttl, and a correspondingly small cache size, but I'm guessing that harden-referral-path increases the memory pressure at a given cache-max-ttl.

server:
    harden-referral-path: yes
    cache-max-ttl: 60
    cache-min-ttl: 0
    do-ip4: yes
    do-ip6: yes
    do-not-query-localhost: yes
    do-tcp: yes
    do-udp: yes
    extended-statistics: yes
    harden-dnssec-stripped: yes
    harden-glue: yes
    hide-identity: yes
    hide-version: yes
    use-caps-for-id: yes
    val-clean-additional: yes
    prefetch: no
    incoming-num-tcp: 2000
    infra-cache-slabs: 8
    jostle-timeout: 1000
    key-cache-slabs: 8
    log-time-ascii: yes
    msg-cache-size: 128m
    msg-cache-slabs: 8
    neg-cache-size: 0
    num-queries-per-thread: 6000
    num-threads: 2
    outgoing-num-tcp: 6000
    outgoing-range: 48000
    rrset-cache-size: 256m
    rrset-cache-slabs: 8
    so-rcvbuf: 4m
    so-sndbuf: 4m
    statistics-interval: 60
    unwanted-reply-threshold: 10000
    username: "unbound"
    auto-trust-anchor-file: "/var/lib/unbound/root.key"
    root-hints: "/var/lib/unbound/root.hints"
    chroot: ""
    directory: "/etc/unbound"
    trusted-keys-file: /etc/unbound/keys.d/*.key
    interface: ::0
    interface: 0.0.0.0
    access-control: 192.241.206.144/32 allow
    logfile:"/tmp/unbound.log"
    verbosity: 1
Comment 3 Jacob Hoffman-Andrews 2017-10-17 00:59:42 CEST
This is on a machine with 4GB of RAM. I checked the ulimits, and max memory size is unlimited. I *think* the config I posted above specifies 384MB of cache:

    msg-cache-size: 128m
    rrset-cache-size: 256m

Unless those are multiplied by the number of slabs? I've been assuming that amount of memory is divided up among the slabs.

Part of the reason I think this is more than just an "out of memory" issue is that (a) it only occurs with harden-referral-path, which is marked experimental, and (b) once it occurs, even if I stop the load generator, Unbound never recovers to the point where it can serve queries in a reasonable amount of time.


# free
              total        used        free      shared  buff/cache   available
Mem:        3881792     1122448     2380136      195092      379208     2330036
Swap:             0           0           0


$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 15001
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Comment 4 Wouter Wijngaards 2017-10-17 09:22:11 CEST
Hi Jacob,

Yes you must not be out of memory.

I think it is getting a loop for the DS fetch for an NS fetch.  This would fix that bug, does it work?

Best regards, Wouter

Index: iterator/iterator.c
===================================================================
--- iterator/iterator.c	(revision 4371)
+++ iterator/iterator.c	(working copy)
@@ -914,6 +914,9 @@
 		generate_a_aaaa_check(qstate, iq, id);
 		return;
 	}
+	/* no need to get the NS record for DS, it is above the zonecut */
+	if(qstate->qinfo.qtype == LDNS_RR_TYPE_DS)
+		return;
 
 	log_nametypeclass(VERB_ALGO, "schedule ns fetch", 
 		iq->dp->name, LDNS_RR_TYPE_NS, iq->qchase.qclass);
Comment 5 Jacob Hoffman-Andrews 2017-10-19 04:34:04 CEST
With that patch, Unbound lasts twice as long (~2 hours instead of ~1 hour) under the same load, but eventually fails again with the same symptoms. Example log message:

Oct 18 23:13:35 unbound[31286:1] error: Could not generate request: out of memory
Oct 18 23:13:35 unbound[31286:1] error: mem error generating DS request

There are a bunch of those, and also a couple minutes later:

Oct 18 23:14:36 unbound[31286:1] error: internal error: looping module (iterator) stopped
Oct 18 23:14:36 unbound[31286:1] info: pass error for qstate reaprenda.com. NS IN
Oct 18 23:14:37 unbound[31286:0] error: internal error: looping module (iterator) stopped
Oct 18 23:14:37 unbound[31286:0] info: pass error for qstate 2013qqsf.com. NS IN

Leaving out some stats and other "normal" logs but let me know if you'd like a full log of the time around the failure.
Comment 6 Wouter Wijngaards 2017-10-19 09:12:56 CEST
Hi Jacob,

Yes I would like to get full logs (set verbosity 4, with unbound-control for example).  If you unbound-control verbosity 4, when the errors start happening, and then some logs with the mem-error and looping errors, that would be nice to have.

Best regards, Wouter
Comment 7 Jacob Hoffman-Andrews 2017-10-19 21:46:21 CEST
Here's a full log from another run, where I set verbosity to 4 as soon as I saw a memory error. Note that at that time Unbound was still serving most responses in a timely manner. It was around 17:20 that performance abruptly fell off a cliff.

Note: 745MB download.

https://s3-us-west-2.amazonaws.com/unbound-ticket/unbound-short.log
Comment 8 Jacob Hoffman-Andrews 2017-11-06 21:01:11 CET
Any thoughts on the above logs? As an additional piece of data, we at Let's Encrypt recently tried to upgrade Unbound to 1.6.6 *without* harden-referral-path (just using our same old config), and we found that it failed after a few hours, with "looping module stopped" showing up in the logs.
Comment 9 Wouter Wijngaards 2017-11-07 09:33:48 CET
Hi Jacob,

I don't see a connection to looping module stopped.   The string 'looping module' does not appear in the log.  But I can see the problem, I think.

What is happening is that there are over 1000 messages outstanding and thousands (5456) requests are waiting to get serviced.  Replies return very slowly or not at all, this is evidenced by timeouts and long roundtrip estimates.  The long roundtrips are probably caused by unbound not managing to service the reply timely (no cpu to service it).

My guess is that the machine has received too many queries to process.  And now the requestlist has grown too long, unbound cannot handle this amount of work at the same time.

Your config has
    num-queries-per-thread: 6000
But unbound cannot handle that many.  What you could do is lower this, to perhaps 1000.  That means queries get jostled-out and dropped (can see that in statistics).  But unbound keeps performing well while it is doing this.  The client you have sending to unbound then has to slow down (timeout and wait).

You could also increase the horse power of that computer.  Maybe with more cores.  And then set num-threads higher.

Best regards, Wouter
Comment 10 Wouter Wijngaards 2017-11-07 09:42:52 CET
Hi Jacob,

Something else from the logs I wanted to show, around 17.20 is that the measured roundtrip is very high:
Oct 19 17:21:30 unbound[660:0] debug: measured roundtrip at 32464 msec

grep -e "measured roundtrip" unbound-short.log  shows you how that value goes from high to very high around 17.20.

That is a 32 second roundtrip.  Somehow unbound has been stalled, or no cpu to process whatsoever.  Perhaps because the system is overloaded in some way.  I guess actual ping times are more like a fraction of a second.

Best regards, Wouter