Bug 1132 - NSD doesn't honour refresh timer for SERVFAIL zones
NSD doesn't honour refresh timer for SERVFAIL zones
Status: RESOLVED FIXED
Product: NSD
Classification: Unclassified
Component: NSD Code
4.1.x
All All
: P5 normal
Assigned To: NSD team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-10-20 18:23 CEST by Anand Buddhdev
Modified: 2016-10-26 08:56 CEST (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Anand Buddhdev 2016-10-20 18:23:30 CEST
When NSD has a slave zone configured, NSD keeps refresh timer information in memory, and writes it to xfrd.state when exiting. Upon restart, it reads xfrd.state, so that it doesn't have to needlessly query master servers until a zone's refresh timer actually expires.

However, support NSD is slaving a zone, whose master responds with SERVFAIL. NSD keeps querying the master, with increasingly larger refresh periods. Upon exiting, NSD also records this timer information in xfrd.state. However, when starting up again, NSD appears to ignore this timer information, and queries the master immediately for such a zone.

If an NSD server is configured with lots of slave zones, and several of these get SERVFAIL responses from the master, then NSD hammers the master server with many XFR attempts to start with, and then several more, until the refresh intervals become large.

This is, in my opinion, a bug. NSD should treat a SERVFAILing zone the same as any other. It should not forget the refresh timer upon restart. It should read the refresh timer for such a zone, and only attempt to refresh it upon expiry of the timer. This allows NSD to be polite towards a master server, and not hammer it upon startup. As an operator with lots of slave zones, this is important to me.
Comment 1 Wouter Wijngaards 2016-10-21 09:03:01 CEST
Hi Anand,

This sounds reasonable.

I believe (at first look) that this is likely caused by the fact that failing zones use the retry timer (not the refresh timer).

Best regards, Wouter
Comment 2 Wouter Wijngaards 2016-10-25 11:31:55 CEST
Hi Anand,

There is a fix in the repository that for SERVFAIL zones performs backoff, and remembers the timeout on next startup.

Best regards, Wouter
Comment 3 Anand Buddhdev 2016-10-25 13:48:33 CEST
Hi Wouter. I picked up the patch from the repo, and applied it to 4.1.13, and then tested on our test server. On the test server, I have 3 zones configured, two of which are fine but one is SERVFAILing. However, I've noticed that upon restart of NSD, it still queries the master for the SERVFAILing zone. Here's the log showing this:

[2016-10-25 11:44:08.292] nsd[2204]: warning: signal received, shutting down...
[2016-10-25 11:44:08.314] nsd[2208]: notice: nsd starting (NSD 4.1.13)
[2016-10-25 11:44:08.318] nsd[2208]: info: setup SSL certificates
[2016-10-25 11:44:08.364] nsd[2209]: info: zonefile 61.195.in-addr.arpa.zone does not exist
[2016-10-25 11:44:08.365] nsd[2209]: info: zone nro.net. read with success
[2016-10-25 11:44:08.365] nsd[2209]: info: zone nro.org. read with success
[2016-10-25 11:44:08.365] nsd[2209]: notice: nsd started (NSD 4.1.13), pid 2208
[2016-10-25 11:44:08.366] nsd[2208]: error: xfrd: zone 61.195.in-addr.arpa. received error code SERVFAIL from 193.0.19.190
[2016-10-25 11:44:08.367] nsd[2208]: error: xfrd: zone 61.195.in-addr.arpa. received error code SERVFAIL from 193.0.19.190
[2016-10-25 11:44:08.368] nsd[2208]: error: xfrd: zone 61.195.in-addr.arpa. received error code SERVFAIL from 193.0.19.190
Comment 4 Wouter Wijngaards 2016-10-25 13:50:44 CEST
Hi Anand,

This is the second restart (not the first)?  It does accurately stop querying after a while (the backoff) ?  So it loses the backoff?

Best regards, Wouter
Comment 5 Wouter Wijngaards 2016-10-25 13:55:10 CEST
Hi Anand,

So the three attempts you see in the log you show are the first volley.  Then the timer gets applied.  (Yes it tries masters multiple times, but it is really a good idea to do that).  That timer should not do the exponential backoff, right? ...  Maybe I need to better reproduce this?

Best regards, Wouter
Comment 6 Anand Buddhdev 2016-10-25 14:08:18 CEST
Sorry about not being clear. Let me clarify.

1. First, I start with 3 zones configured. There are no zone files on disk, and no xfrd.state file.

2. NSD AXFRs two zones, and fails on the 3rd one.

3. I run "nsd-control write" to flush the 2 good zones to disk.

4. I let NSD run for a while, and notice it starting to back-off for the failing zone.

5. Now, I restart NSD. It writes out xfrd.state, exits, and starts up again.

6. It reads in the 2 good zones from disk, and honours their refresh timer. However, for the zone that was failing, it immediately queries tor master, rather than waiting for the timer.


I have some comments on this process:

1. I still disagree with NSD's 3x attempt at AXFR. It is pointless. If a master gives a REFUSED/SERVFAIL response, that's hardly going to change in the very next millisecond. If an NSD server has a few hundred failing zones, it's going to make things worse for the master, by making 3x the TCP/XFR attempts.

2. NSD should distinguish between a refresh of a good zone, and a failing zone. Let me expand on this.

Let's say you have 2 slave zones configured. You start NSD, and one zone is XFRed successfully, but the second one fails. Let's also assume that the good zone has not yet bene flushed to disk. Now, NSD is restarted.

At this point, it is reasonable for NSD to think:

1. The zone that it had been able to XFR successfully before, but isn't on disk, should be XFRed again. It is likely to succeed, and we want a fresh copy of it immediately.

2. The zone that was SEVFAILing, is likely to keep SERVFAILing. Don't bother the master server for it immediately, but keep using the timers that we recorded at exit previously. This is especially important, because for a SERVFAILing zone, NSD is going to hammer the master with not one, but THREE attempts. NSD, in the face of a failing XFR, makes matters worse by being even more aggressive towards the master. This is why I also disagree with you strongly about the triple attempt at XFR.
Comment 7 Wouter Wijngaards 2016-10-25 14:34:15 CEST
Hi Anand,

Okay, set the MAX_ROUNDS to 1, NSD tries every master once.  Then waits for timeout.

Also, fixed, it'll save the 'backoff' value to xfrd.state.  This value is also read in again.  For older xfrd.state files without this value, it'll ignore the xfrd.state file, and write a new version of the file upon exit.

Does this version work the way you want?

Best regards, Wouter
Comment 8 Anand Buddhdev 2016-10-25 15:42:08 CEST
Hi Wouter,

Thanks for setting MAX_ROUNDS to 1. That's a great improvement.

I see that "backoff" is being recorded in xfrd.state. However, NSD doesn't appear to be using it. Upon every restart, NSD is still querying the master for a failing zone. Here's the log:


[2016-10-25 13:36:48.340] nsd[22830]: notice: nsd starting (NSD 4.1.13)
[2016-10-25 13:36:48.344] nsd[22830]: info: setup SSL certificates
[2016-10-25 13:36:48.389] nsd[22831]: info: zonefile 61.195.in-addr.arpa.zone does not exist
[2016-10-25 13:36:48.390] nsd[22831]: info: zone nro.net. read with success
[2016-10-25 13:36:48.390] nsd[22831]: info: zone nro.org. read with success
[2016-10-25 13:36:48.390] nsd[22831]: notice: nsd started (NSD 4.1.13), pid 22830
[2016-10-25 13:36:48.391] nsd[22830]: error: xfrd: zone 61.195.in-addr.arpa. received error code SERVFAIL from 193.0.19.190


[2016-10-25 13:37:26.051] nsd[22831]: warning: signal received, shutting down...
[2016-10-25 13:37:26.073] nsd[23302]: notice: nsd starting (NSD 4.1.13)
[2016-10-25 13:37:26.077] nsd[23302]: info: setup SSL certificates
[2016-10-25 13:37:26.122] nsd[23303]: info: zonefile 61.195.in-addr.arpa.zone does not exist
[2016-10-25 13:37:26.123] nsd[23303]: info: zone nro.net. read with success
[2016-10-25 13:37:26.123] nsd[23303]: info: zone nro.org. read with success
[2016-10-25 13:37:26.123] nsd[23303]: notice: nsd started (NSD 4.1.13), pid 23302
[2016-10-25 13:37:26.124] nsd[23302]: error: xfrd: zone 61.195.in-addr.arpa. received error code SERVFAIL from 193.0.19.190


[2016-10-25 13:37:58.889] nsd[23303]: warning: signal received, shutting down...
[2016-10-25 13:37:58.908] nsd[23320]: notice: nsd starting (NSD 4.1.13)
[2016-10-25 13:37:58.912] nsd[23320]: info: setup SSL certificates
[2016-10-25 13:37:58.958] nsd[23321]: info: zonefile 61.195.in-addr.arpa.zone does not exist
[2016-10-25 13:37:58.958] nsd[23321]: info: zone nro.net. read with success
[2016-10-25 13:37:58.959] nsd[23321]: info: zone nro.org. read with success
[2016-10-25 13:37:58.959] nsd[23321]: notice: nsd started (NSD 4.1.13), pid 23320
[2016-10-25 13:37:58.960] nsd[23320]: error: xfrd: zone 61.195.in-addr.arpa. received error code SERVFAIL from 193.0.19.190
Comment 9 Wouter Wijngaards 2016-10-25 16:03:05 CEST
Hi Anand,

Ah yes, for zones with no content.  Fixed that!
(And some smaller issues with axfr fallback with NUM_ROUNDS=1).

I think it should work now.  I stopped it from refreshing after reading xfrd.state, but then another check after it (for notifies) saw no content and started a transfer as well.  it omits that for zones with no content now.

Best regards, Wouter
Comment 10 Anand Buddhdev 2016-10-25 17:16:23 CEST
Thanks Wouter! I confirm that your latest set of commits now work as intended :)

Thank you for all these fixes. I look forward to a release of 4.1.14 to take advantage of these new features.
Comment 11 Wouter Wijngaards 2016-10-26 08:56:21 CEST
Hi Anand,

Good to hear!  I'll close this report, then.

Best regards, Wouter