Bug 817 - xfrd update failed loop - reload related race
xfrd update failed loop - reload related race
Status: NEW
Product: NSD
Classification: Unclassified
Component: NSD Code
4.1.x
x86_64 OpenBSD
: P5 normal
Assigned To: NSD team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-08-10 12:23 CEST by mail
Modified: 2016-12-19 09:43 CET (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description mail 2016-08-10 12:23:35 CEST
We have a large environment with a lot of Active Directory DNS master servers
where we slave a (lot of) zone(s) from to our NSD. So the NSD receives several 
notifies per second and attempts several transfers every few seconds and has in
general a lot of throughput to handle.

Sometimes IXFR is failing and it does fallback to AXFR, working fine.

However, the post transfer reload seems to also fail sometimes, e.g. in the log 
we have the following (anonymized myhostname/myzone):

Aug 10 02:52:18 myhostname nsd[17775]: xfrd: zone myzone.tld: needs to be loaded. reload lost? try again

We have no idea WHY the reload is lost at all. Maybe some limit is reached on
the machine?

Fine for now, but unfortunately afterwards, xfrd enters a "endless loop" of the
following messages:

Aug 10 02:52:30 myhostname nsd[17775]: xfrd: zone myzone.tld committed "received update to serial 9297248 at 2016-08-10T02:52:30 from 10.32.208.10"
Aug 10 02:52:30 myhostname nsd[17775]: xfrd: zone myzone.tld: soa serial 9297248 update failed, restarting transfer (notified zone)
Aug 10 02:53:03 myhostname nsd[17775]: xfrd: zone myzone.tld committed "received update to serial 9297249 at 2016-08-10T02:53:03 from 10.32.208.10"
Aug 10 02:53:03 myhostname nsd[17775]: xfrd: zone myzone.tld: soa serial 9297249 update failed, restarting transfer (notified zone)
[...]

...until the zone expires :(

A manual nsd-control reload myzone.tld does NOT fix the looping xfrd.
A manual nsd-control transfer myzone.tld does NOT fix the looping xfrd.

A manual nsd-control force_transfer myzone.tld does FIX it, until next same lost 
reload and loop (so this is our current hotfix/workaround: an hourly cron-job
which calls force_transfer for all zones).

So, all looks like there is some race between the master triggering the reload
after transfers and the xfrd process.  

We run NSD 4.1.7 on OpenBSD 5.9 (from base) and our config is simple and 
looks like this:

server:
        server-count: 2
        hide-version: yes
        verbosity: 1
        database: "" # disable database

remote-control:
        control-enable: yes

zone:
       name: myzone.tld
       zonefile: slave/%s
       allow-notify: 1.1.1.1 NOKEY
       request-xfr: 1.1.1.1 NOKEY
       allow-notify: 1.1.1.2 NOKEY
       request-xfr: 1.1.1.2 NOKEY
       allow-notify: 1.1.1.3 NOKEY
       request-xfr: 1.1.1.3 NOKEY
       allow-notify: 1.1.1.4 NOKEY
       request-xfr: 1.1.1.4 NOKEY

[... lot's of zones ...]

Increasing xfrd-reload-timeout: 10 does NOT fix it, xfrd still enters the loop
mentioned above.

If we postpone the initial zone writes with high zonefiles-write: 86700 value 
we do not hit the bug loop above until first write, as zone->soa_disk_acquired
seems 0 and conditions are not met in xfrd_check_failed_updates().

So this seems to be related to the xfrd->reload_cmd_last_sent timestamp.

Sidenote: These timestamps seem to be gathered via time(), note: time() values
are subject to changes/adjustments/setting back by e.g. ntpd so it would be 
really better to use clock_gettime(...MONOTONIC), if these timestamps are 
needed to be reliable increasing and avoiding races.

Sidenote: The same bug seemed to be reported already back in 2008 
on the mailing-lists:
https://www.nlnetlabs.nl/pipermail/nsd-users/2008-July/000799.html
https://open.nlnetlabs.nl/pipermail/nsd-users/2008-September/000820.html
Comment 1 Wouter Wijngaards 2016-08-10 13:51:16 CEST
Hi umaxx,

The error is something to look at.

The base cause is that the master servers do not properly send zone transfers.  The IXFRs do not actually match (old vs new).  And, from what you say force-transfer fixes it, the soa serial number is also not reliably kept track of.  So I would recommend trying to get the master servers fixed (if you can).

Perhaps a config where you disable IXFR for the zone?  This may workaround the issue with the master's notion of changes.
change the config lines to this:
request-xfr: AXFR 1.1.1.1 NOKEY

Best regards, Wouter
Comment 2 Wouter Wijngaards 2016-08-10 13:53:19 CEST
Hi,

You mentioned the older reports, we fixed that by checking if the IXFR has problems, and if so, reverting to an AXFR.  But this fix alone seems insufficient for you; and additionally there is another issue with the timestamps it looks like.

Best regards, Wouter
Comment 3 mail 2016-08-10 15:07:17 CEST
(In reply to Wouter Wijngaards from comment #1)
> The error is something to look at.
> 
> The base cause is that the master servers do not properly send zone
> transfers.  The IXFRs do not actually match (old vs new).  And, from what
> you say force-transfer fixes it, the soa serial number is also not reliably
> kept track of.  

Yes, immediately before the "reload lost" message there is the following in
the logs:

Aug 10 12:38:28 myhostname nsd[20243]: diff: domain cetqe-patl7.myzone.tld. does not exist
Aug 10 12:38:28 myhostname nsd[20243]: Failed to apply IXFR cleanly (deletes nonexistent RRs, adds existing RRs). Zone myzone.tld. contents is different from master, starting AXFR. Transfer received update to serial 9298441 at 2016-08-10T12:38:19 from 1.1.1.2
Aug 10 12:38:28 myhostname nsd[20243]: zone myzone.tld. received update to serial 9298441 at 2016-08-10T12:38:19 from 1.1.1.2 of 300 bytes in 0.000107 seconds

...and then the:

Aug 10 12:38:28 myhostname nsd[14088]: xfrd: zone magna.global: needs to be loaded. reload lost? try again

...followed by the "update failed, restarting transfer" loop.

> So I would recommend trying to get the master servers fixed
> (if you can).

Not really possible.  Also I do not think they are "broken".
These are "standard" Active Directory Domain Controllers (a lot of them),
which act master-master sending a lot of notifies. 
Also, BIND can handle them fine (we are in the middle of the process
of migrating from BIND to NSD/Unbound).

Sidenote: the myzone.tld is rather large 1.5MB with more than 45k lines on
disk. Maybe this is related? Some buffer filled up? Also broken network
is possible, sometimes loosing some packages somewhere.
 
> Perhaps a config where you disable IXFR for the zone?  This may workaround
> the issue with the master's notion of changes.
> change the config lines to this:
> request-xfr: AXFR 1.1.1.1 NOKEY

That would work but would transfer the whole 1.5MB every some seconds :(
The IXFR works mostly. It just (reliable) fails after runnnig several thousands
IXFRs for about an hour.
Comment 4 mail 2016-08-10 15:21:12 CEST
(In reply to Wouter Wijngaards from comment #2)

> You mentioned the older reports, we fixed that by checking if the IXFR has
> problems, and if so, reverting to an AXFR.  But this fix alone seems
> insufficient for you; 

Yes and no :)

I really think NSD/xfrd should recover from a broken IXFR correctly instead
of going into a logging loop.

Main question for me is: Why does the "reload lost" happen at all, 
and why can it not correctly recover/commit-and-serve afterwards and 
end up in the loop, instead of just doing AXFR correctly?

Within the mentioned "update failed, restarting transfer" loop I see the 
following interesting behaviour:

$ nsd-control zonestatus myzone.tld
zone:   myzone.tld
        state: ok
        served-serial: "9298439 since 2016-08-10T12:38:28"
        commit-serial: "9299506 since 2016-08-10T14:37:00"

So, it has the 9299506 correctly transfered, but then fails to serve it?

In the logs I have:

Aug 10 14:37:00 myhostname nsd[14088]: xfrd: zone myzone.tld committed "received update to serial 9299506 at 2016-08-10T14:37:00 from 1.1.1.4"
Aug 10 14:37:04 myhostname nsd[886]: notify for myzone.tld. from 1.1.1.4 serial 9299507
Aug 10 14:37:04 myhostname nsd[14088]: xfrd: zone myzone.tld committed "received update to serial 9299507 at 2016-08-10T14:37:04 from 1.1.1.4"
Aug 10 14:37:04 myhostname nsd[14088]: xfrd: zone myzone.tld: soa serial 9299507 update failed, restarting transfer (notified zone)

Then I see:

$ nsd-control zonestatus myzone.tld 
zone:   myzone.tld
        state: refreshing
        served-serial: "9298439 since 2016-08-10T12:38:28"
        commit-serial: "9298439 since 2016-08-10T12:38:28"

... so back from 929950x down to 9298439 while refreshing/restarting? :(

and then the loop starts over:

$ nsd-control zonestatus myzone.tld 
zone:   myzone.tld
        state: ok
        served-serial: "9298439 since 2016-08-10T12:38:28"
        commit-serial: "9299509 since 2016-08-10T14:37:31"

> and additionally there is another issue with the
> timestamps it looks like.

Yes, it seems to be.
Comment 5 mail 2016-08-10 15:23:17 CEST
I forgot to mention: since we can reproduce this easily I'm happy
to do any further test. 

Let me know how we can track this down further?
Adding some debug printf's somewhere should be no problem!

Thanks!
Comment 6 mail 2016-08-10 15:35:03 CEST
Some more information, the underlying OpenBSD runs with the default
configuration. So no network or socket buffer tweaks applied.
From login.conf default daemon class is used. 

Lot's of throughput on the system.

So, likely we hit some OS limit here, resulting in the initially 
failing IXFRs. But I have no idea which ones, and how to track down further.
Comment 7 mail 2016-08-10 15:46:23 CEST
Just to be clear: this does not happen after *each* failing IXFR.

Mostly IXFR just works.
Sometimes IXFR fails, but succesful fallback to AXFR, everything fine.

Rarely (after about 1h in my case), IXFR fails, fallback to
AXFR happens, followed by the failing loop.
Comment 8 Wouter Wijngaards 2016-08-10 15:49:09 CEST
Hi umaxx,

Is this failing AXFR always one that is very small, like the 300 byte one you logged?  The working ones would then be 45kb or so?

(45kb is not a problem, by the way; NSD is also used for zones in the Gbs of size).

Best regards, Wouter
Comment 9 mail 2016-08-10 16:53:02 CEST
(In reply to Wouter Wijngaards from comment #8)

> Is this failing AXFR always one that is very small, like the 300 byte one
> you logged? The working ones would then be 45kb or so?

Nope. A quick grep reveals that it fails on various sizes, I found entries in the last hours between 2xx bytes and 8k.
Comment 10 Wouter Wijngaards 2016-08-12 11:49:35 CEST
I have sent a patch to you; this should break up the loop.  (But does not handle the timing issue you noted, or the failed download itself).  When the loop starts and it looks like this patch wants, it'll start a force_transfer for you.  That hopefully should make it get out of the mess by itself.

Best regards, Wouter
Comment 11 mail 2016-08-16 19:19:34 CEST
(In reply to Wouter Wijngaards from comment #10)
> I have sent a patch to you; this should break up the loop.  (But does not
> handle the timing issue you noted, or the failed download itself).  When the
> loop starts and it looks like this patch wants, it'll start a force_transfer
> for you.  That hopefully should make it get out of the mess by itself.

As replied in private, the patch seemed not to solve the loop.

Please find below a proposed suggestion for a better time function
returning/using double timestamps and monotonic time agnostic to 
ntpd changes (works on BSD/OSX/LINUX).

#include <time.h>
 
#ifdef __MACH__
#include <mach/mach_time.h>
#endif

#define XFRD_NANO (1e-9)

static double xfrd_get_time(void) {
#ifdef __MACH__
    static mach_timebase_info_data_t tb;

    if (!tb.denom && mach_timebase_info(&tb) == -1)
        err(1, "mach_timebase_info failed");
    return mach_absolute_time() * tb.numer / tb.denom * C_NANO;
#else
    struct timespec ts = { 0, 0 };

    if (clock_gettime(CLOCK_MONOTONIC, &ts) == -1)
        err(1, "clock_gettime failed");
    return ts.tv_sec + ts.tv_nsec * C_NANO;
#endif
}