Bug 1445

Summary: AXFR hangs in refreshing state
Product: NSD Reporter: mail
Component: NSD CodeAssignee: NSD team <nsd-team>
Status: ASSIGNED ---    
Severity: normal CC: jared, wouter
Priority: P5    
Version: 4.1.x   
Hardware: x86_64   
OS: OpenBSD   

Description mail 2017-09-14 10:18:45 CEST
This is on OpenBSD 6.1-stable with NSD version: 4.1.15

Easily reproduceable, after NSD restart, master server (with hundreds of zones) is sending (a lot of) notifies, which are received by slave and result in
refreshing of zones via AXFR. (Anonymized) log shows the received notify:

Sep 14 09:23:17 nsd-slavehost nsd[58722]: notify for myzone.tld. from 172.31.65.240 serial 2017091401

However, slave AXFR is HANGING in "refreshing state", like this:

$ sudo nsd-control zonestatus myzone.tld
zone:   myzone.tld
        state: refreshing
        served-serial: "2017091202 since 2017-09-14T07:34:58"
        commit-serial: "2017091202 since 2017-09-14T07:34:58"

After next triggered (SOA retry) or manual transfer everything is fine again:
$ sudo nsd-control transfer myzone.tld   
ok

$ sudo nsd-control zonestatus myzone.tld 
zone:   myzone.tld
        state: refreshing
        served-serial: "2017091202 since 2017-09-14T07:34:58"
        commit-serial: "2017091202 since 2017-09-14T07:34:58"
        notified-serial: "0 since 2017-09-14T09:31:02"
        transfer: "TCP connected to 172.31.65.240"

$ sudo nsd-control zonestatus myzone.tld 
zone:   myzone.tld
        state: ok
        served-serial: "2017091401 since 2017-09-14T09:31:10"
        commit-serial: "2017091401 since 2017-09-14T09:31:10"

I think this hanging might be related to earlier reported bug #817.
Comment 1 Wouter Wijngaards 2017-09-14 10:53:47 CEST
Hi Mail,

I think this may be because the SOA refresh and retry times are very large.  Attempts to perform AXFR are failing, and NSD is waiting before retry.

Pick a lower retry value in the SOA.  Or use this option in nsd.conf:
max-retry-time: 120
that limits the retry time to 2 minutes.

This patch shows you how long NSD is waiting:
Index: remote.c
===================================================================
--- remote.c	(revision 4776)
+++ remote.c	(working copy)
@@ -992,6 +992,10 @@
 		if(!print_soa_status(ssl, "notified-serial", &xz->soa_notified,
 			xz->soa_notified_acquired))
 			return 0;
+	} else if(xz->event_added) {
+		if(!ssl_printf(ssl, "\twait: \"%u sec for next attempt\"\n",
+			(unsigned)xz->timeout.tv_sec))
+			return 0;
 	}
 
 	/* UDP */


And the output looks like this:
zone:	example.com
	state: refreshing
	served-serial: "1379078166 since 2017-09-14T10:49:22"
	commit-serial: "1379078166 since 2017-09-14T10:52:49"
	wait: "12417 sec for next attempt"

Do you think this change is useful in the nsd-control output?

Best regards, Wouter
Comment 2 Wouter Wijngaards 2017-09-14 10:57:18 CEST
Hi Mail,

What may also be causing this is that now, NSD will not try 3 times every IP address of the master, but just once before waiting, this was a change a couple versions ago.  The master is very busy and maybe a retry straight away can fix this.  You can make this happen by listing the master's IP address twice (or three times) in the nsd.conf part, by repeating the request-xfr: 192.0.2.1 NOKEY lines several times.  This then causes the slave to attempt to transfer from that IP address several times instead of waiting for the retry timeout after a transfer failure.

Best regards, Wouter
Comment 3 mail 2017-09-25 21:48:40 CEST
(In reply to Wouter Wijngaards from comment #1)

> I think this may be because the SOA refresh and retry times are very large. 
> Attempts to perform AXFR are failing, and NSD is waiting before retry.

But why does the AXFR fail reliable?
 
> Pick a lower retry value in the SOA.  Or use this option in nsd.conf:
> max-retry-time: 120
> that limits the retry time to 2 minutes.

That is a bad workaround with a lot of zones. 

> This patch shows you how long NSD is waiting:
> Index: remote.c
> ===================================================================
> --- remote.c	(revision 4776)
> +++ remote.c	(working copy)
> @@ -992,6 +992,10 @@
>  		if(!print_soa_status(ssl, "notified-serial", &xz->soa_notified,
>  			xz->soa_notified_acquired))
>  			return 0;
> +	} else if(xz->event_added) {
> +		if(!ssl_printf(ssl, "\twait: \"%u sec for next attempt\"\n",
> +			(unsigned)xz->timeout.tv_sec))
> +			return 0;
>  	}
>  
>  	/* UDP */
> 
> 
> And the output looks like this:
> zone:	example.com
> 	state: refreshing
> 	served-serial: "1379078166 since 2017-09-14T10:49:22"
> 	commit-serial: "1379078166 since 2017-09-14T10:52:49"
> 	wait: "12417 sec for next attempt"
> 
> Do you think this change is useful in the nsd-control output?

While not solving the issue, the output *is* useful and should be added.
Comment 4 mail 2017-09-25 21:53:35 CEST
(In reply to Wouter Wijngaards from comment #2)
 
> What may also be causing this is that now, NSD will not try 3 times every IP
> address of the master, but just once before waiting, this was a change a
> couple versions ago. 

Maybe the cause, for not noticing earlier, yes.  However, the real question is: why did it need three attempts and fails on the first? 

> The master is very busy and maybe a retry straight
> away can fix this.  You can make this happen by listing the master's IP
> address twice (or three times) in the nsd.conf part, by repeating the
> request-xfr: 192.0.2.1 NOKEY lines several times.  This then causes the
> slave to attempt to transfer from that IP address several times instead of
> waiting for the retry timeout after a transfer failure.

I know, but this is still a workaround, not a solution for the actual problem.
I can easily reproduce this *reliable*. It _always_ "fails" on the first(?) attempt after NSD restart (note: not reload).
Comment 5 Wouter Wijngaards 2017-10-02 13:53:16 CEST
Hi Mail,

I have committed that change so the output looks nicer (reworded as, 'time between attempts', because that is precisely what the time value is).  That helps the output but does not fix the problem.

Since you can reproduce every time, I would like to find out what is causing it.  Can you tell me what NSD gives as the failure for the transfer (with higher verbosity -V 5  (or even -F -1 -L 2 added for debug output).  It would be interesting to see what  happens.

Another way to get a clue, is if it works with the 'multiple xfr entries in config', and does not work with one.  So then we would know that the first fails, but the second succeeds.  (and then I would like to find the difference between the first and second attempts?).

Best regards, Wouter
Comment 6 Jared Mauch 2017-11-13 20:39:25 CET
I've seen some similar issues on my nameserver where zones are not refreshing:

% nsd-control zonestatus | grep 'refreshing' | wc -l
2146

Is there a way to limit the time spent refreshing or better control the quotas for refreshes?

At present I have 11k zones:

hostname:~$ nsd-control stats_noreset
...
zone.master=70
zone.slave=11281

Can I permit more than 40 zone transfers at once?
Comment 7 Wouter Wijngaards 2017-11-13 22:19:04 CET
Hi,

Perhaps this should be configurable, this value in xfrd.h is now in code, and set at 32, if you set it higher you enable more concurrent zone transfers.

in xfrd.h:242
#define XFRD_MAX_TCP 32

change that to 320 and then you have a lot more zone transfer ability.  Does that solve your problem?  If so, it should either be configurable or have a higher default.

Best regards, Wouter
Comment 8 Wouter Wijngaards 2017-11-14 09:15:12 CET
Hi,

The code base contains patch #2871, and this means the limits are increased, you could perhaps try that (similar to increasing that xfrd.h #define).  That may solve the problem(s) for you?

Best regards, Wouter