Bug 804

Summary: Unbound fails to resolve after connectivity loss.
Product: unbound Reporter: martian67
Component: serverAssignee: unbound team <unbound-team>
Status: RESOLVED FIXED    
Severity: normal CC: cathya, daniel, drahnier, martian67, wouter
Priority: P5    
Version: 1.5.9   
Hardware: x86_64   
OS: Windows   

Description martian67 2016-07-25 13:33:31 CEST
If there is a connectivity failure that causes timeouts or ICMP no route messages to the default gateway, the unbound daemon will permanently lose the ability to resolve hosts until it is restarted. Behaviour does not seem consistent, and it seems to happen more often when connectivity is rapidly flapping between timeouts and available.
Comment 1 martian67 2016-07-25 13:34:59 CEST
to clarify, the loss of the ability to resolve continues _after_ connectivity is restored and working again.
Comment 2 Wouter Wijngaards 2016-07-25 13:53:24 CEST
Hi Martian67,

Yes unbound is waiting 15 minutes to spare the other servers traffic.  There is documentation here how this happens.

http://unbound.net/documentation/info_timeout.html

If this happens a lot for you, change the infra-ttl timeout from 15 minutes to something (a lot) lower, like 60 seconds.

Best regards, Wouter
Comment 3 martian67 2016-07-25 14:07:06 CEST
DNS resolving will continue to be broken far beyond 15 minutes, it will fail to work indefinitely until the daemon is restarted. Given the transient nature of this issue its hard to replicate on demand, do you have any debugging commands i should run when it occurs again?
Comment 4 Bink 2016-08-15 23:15:15 CEST
I can confirm this behavior--increasing or decreasing infra-host-ttl has no positive effect.  Some more detail may be found at https://www.reddit.com/r/openbsd/comments/4jw4sx/the_day_some_of_the_dns_stopped.
Comment 5 drahnier 2016-08-23 02:02:37 CEST
I think, I'm frequently running into this problem, too and would like to know if this is going to be 'fixed' in the next build and if perhaps a pre-release version of such a build could be made available (for my platform) for testing.
Comment 6 Wouter Wijngaards 2016-08-23 09:35:20 CEST
Hi,

Could the exponential backoff and rapid-flapping result in marking all servers as permanently down?  And thus have (hours-long) timeouts?

How could unbound detect this rapid-flapping or connectivityloss (or connectivity resumption)?  Right now detection is on a server-by-server basis.

Best regards, Wouter
Comment 7 Wouter Wijngaards 2016-08-23 10:33:05 CEST
Hi,

There is a fix in the code repository that should alleviate some of the issue.  But has not solved the underlying cause.  It'll remove the 'waiting for empty list' entries from the requestlist.  That should make things work again after an outage.

Those shouldn't get in that state either - but now that error is logged when that happens not when unbound has ground to a stop.  And logs about how that happens (with high verbosity) may help.

Best regards, Wouter
Comment 8 Wouter Wijngaards 2016-08-23 10:51:52 CEST
Hi,

Fixed the state machines getting stuck in waiting for empty_list.  It was the failures caused by the outage causing the counter for num_target_queries to be not reset properly, causing the other queries to remain stuck in the waiting state.  This should resolve the issue, I think.

Best regards, Wouter
Comment 9 drahnier 2016-08-24 01:05:52 CEST
(In reply to Wouter Wijngaards from comment #8)
> Hi,
> 
> Fixed the state machines getting stuck in waiting for empty_list.  It was
> the failures caused by the outage causing the counter for num_target_queries
> to be not reset properly, causing the other queries to remain stuck in the
> waiting state.  This should resolve the issue, I think.
> 
> Best regards, Wouter

So, for heaven's sake (or whatever you value), PLEASE make available a compiled executable (for Windows - btw: why is there no 64bit version? )asap.
Comment 10 Wouter Wijngaards 2016-08-25 09:37:27 CEST
Hi,

32bit versions here:
http://www.nlnetlabs.nl/~wouter/unbound-1.5.10_20160825.zip 
and unbound_setup_1.5.10_20160825.exe

At the same place I have also put 64bit compiled versions (untested, but I set the compiler flag to 64bit):
unbound-1.5.10rc7.zip unbound_setup_1.5.10rc7.exe

Best regards, Wouter
Comment 11 drahnier 2016-08-25 10:05:00 CEST
(In reply to Wouter Wijngaards from comment #10)
> Hi,
> 
> 32bit versions here:
> http://www.nlnetlabs.nl/~wouter/unbound-1.5.10_20160825.zip 
> and unbound_setup_1.5.10_20160825.exe
> 
> At the same place I have also put 64bit compiled versions (untested, but I
> set the compiler flag to 64bit):
> unbound-1.5.10rc7.zip unbound_setup_1.5.10rc7.exe
> 
> Best regards, Wouter

32bit version installed and so far running without any problems on one machine.

64bit version reports "fatal error: could not read config file". Appears, that the 64bit version expects it's config file to be founhd in 'Program Files (x86)\Unbound'. That's funny, because the 32bit version, along with it's config file, is quite happy living under 'Program Files\Unbound' ...
Comment 12 drahnier 2016-08-25 10:23:58 CEST
(Somehow I can't edit my recent post:) I double-checked registry entries for both the 32bit and 64bit versions. Both have 'ImagePath' as 'REG_EXPAND_SZ' '"C:\Program Files\Unbound.exe" -c "C:\Program Files\Unbound\service.conf" -w service',  but the 64bit version somehow ignores the path specified for 'service.conf'.
Comment 13 drahnier 2016-08-25 10:48:11 CEST
More observations about the 64bit Windows version:
(1) placing a copy of 'service.conf' in 'C:\Program Files(x86)\Unbound' is sufficient to get the 64bit version running, even it is installed under 'Program Files\Unbound'.
(2) while the 32bit version of Unbound feels content with occupying around 1.4MB of memory under low load, the 64bit version, under comparable load,  consumes a whopping 65.4MB! That's about 50 times as much and definitely does not look right.
Comment 14 drahnier 2016-08-25 12:31:21 CEST
More on the 64bit Windows version:
Memory footprint readjusted itself from the initially reported 65.4MB to perfectly acceptable 3.7MB after one hour. So, except for it expecting to find 'service.conf' under 'Program Files (x86)\Unbound' [that should be easy to  fix], the 64bit Windows version is doing fine now.
Comment 15 drahnier 2016-08-29 01:58:14 CEST
(In reply to drahnier from comment #14)
> More on the 64bit Windows version:
> Memory footprint readjusted itself from the initially reported 65.4MB to
> perfectly acceptable 3.7MB after one hour. So, except for it expecting to
> find 'service.conf' under 'Program Files (x86)\Unbound' [that should be easy
> to  fix], the 64bit Windows version is doing fine now.

Update: The 'service.conf' issue appears to be with 'unbound-checkconf', and not with 'unbound' itself. 'unbound-checkconf' requires a full path name as argument and if that is not given, uses "C:\Program Files (x86)\Unbound' as prefix for 'service.conf', completely disregarding the fact that the install directory is 'Program Files\Unbound' and that it is the 64bit version of 'unbound' which is running. IMHO, 'unbound-checkconf' should look for 'service.conf' in the directory where 'unbound' is installed first, before applying any defaults ...

Btw: The 64bit version is doing just fine.
Comment 16 martian67 2016-09-06 16:06:00 CEST
Haven't seen the issue manifest itself with this fix, seems like its been fixed.
Comment 17 Wouter Wijngaards 2016-09-06 16:07:52 CEST
Hi,

The original bug has been fixed! Thanks for the debug information.

Best regards, Wouter