Bug 415 - unbound fail to resolve A record of edadfs.partners.extranet.microsoft.com
unbound fail to resolve A record of edadfs.partners.extranet.microsoft.com
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
unspecified
Other Linux
: P5 enhancement
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-10-22 17:38 CEST by Piotr Majka
Modified: 2011-10-24 15:36 CEST (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Piotr Majka 2011-10-22 17:38:05 CEST
I had problem to get valid resposone with one record at least:
[root@gts-new ~]# host edadfs.partners.extranet.microsoft.com 10.0.0.1
;; connection timed out; no servers could be reached
[root@gts-new ~]#
Unbound work fine of course:
[root@gts-new ~]# host www.nlnetlabs.nl 10.0.0.1
Using domain server:
Name: 10.0.0.1
Address: 10.0.0.1#53
Aliases:

www.nlnetlabs.nl has address 213.154.224.1
www.nlnetlabs.nl has IPv6 address 2001:7b8:206:1::1
[root@gts-new ~]#


I had no problem when query go to bind dns server (9.8.1):

[root@gts-new ~]# host edadfs.partners.extranet.microsoft.com 127.0.0.1
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases:

edadfs.partners.extranet.microsoft.com has address 65.55.28.131
[root@gts-new ~]#

I try pdns-recursor too - with success - work good.
Info about unbound:
[root@gts-new ~]# /usr/sbin/unbound -h
usage:  unbound [options]
        start unbound daemon DNS resolver.
-h      this help
-c file config file to read instead of /etc/unbound/unbound.conf
        file format is described in unbound.conf(5).
-d      do not fork into the background.
-v      verbose (more times to increase verbosity)
Version 1.4.13
linked libs: libevent 1.4.13-stable (it uses epoll), ldns 1.6.8, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008
linked modules: validator iterator
configured for x86_64-redhat-linux-gnu on Tue Oct 11 14:33:36 CEST 2011 with options: '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--target=x86_64-redhat-linux-gnu' '--program-prefix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--localstatedir=/var' '--sharedstatedir=/usr/com' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--with-ldns=' '--with-libevent' '--with-pthreads' '--with-ssl' '--disable-rpath' '--enable-debug' '--disable-static' '--with-conf-file=/etc/unbound/unbound.conf' '--with-pidfile=/var/run/unbound/unbound.pid' '--enable-sha2' '--disable-gost'
BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl
[root@gts-new ~]#

Running at RHEL 5.7, Version 1.4.12 is buggy too - no response. Unbound installed from rawhide repo on Fedora 17 (rawhide installation - 1.4.13) fails like above - standard configuration - I dont change anything at unbound.conf.
Comment 1 Wouter Wijngaards 2011-10-24 10:09:03 CEST
Hi Piotr,

Okay, so it tries to resolve and hits partners.extranet.microsoft.com,
partners.extranet.microsoft.com.        3600    IN      NS      dns13.one.microsoft.com.
partners.extranet.microsoft.com.        3600    IN      NS      dns11.one.microsoft.com.
partners.extranet.microsoft.com.        3600    IN      NS      dns12.one.microsoft.com.
partners.extranet.microsoft.com.        3600    IN      NS      dns10.one.microsoft.com.

;; ADDITIONAL SECTION:
dns13.one.microsoft.com.        3600    IN      A       65.55.31.17
dns11.one.microsoft.com.        3600    IN      A       94.245.124.49
dns12.one.microsoft.com.        3600    IN      A       207.46.55.10
dns10.one.microsoft.com.        3600    IN      A       131.107.125.65

Now,
94.245.124.49 is recursive: not an authoritative server but a cache.
65.55.31.17 is recursive: not an authoritative server but a cache.
207.46.55.10 is lame: it does not serve partners.extranet.microsoft.com.
131.107.125.65 is down: it times out and does not answer.

The first pass does not result in a server that unbound wants to talk to. (it tries 131... a couple times but it keeps timing out).  So it attempts to fetch more choices (by searching for other nameservers for this domain).  But there is no IPv6 deployed or anything like that.

So it ends up choosing from a list of bad choices.  It sends the query to the recursive server (it may be cache poisoned, but its the only way right now).  This is then the answer for the A record.  It works.

When the next query comes along, however, it wants to avoid the timeout server, but the server selection here has a bug where it picks up this timeout server and rejects this, but rejects the entire query instead of going to the other choices.

Thanks for the report, this particular misconfiguration triggers an interesting codepath, and this is a bug in the server selection.

One workaround on unbound.conf is
do-not-query-address: 131.107.125.65
that will blacklist that address, which is (probably) harmless but helps server selection workaround this bug by asking the other servers.  (Unless that server is the only one to serve other zones (for which it does not timeout)).

If you can influence the servers, please tell the dns operators to fix their servers so they do not deploy caches, but authority servers.  That would workaround this bug (if you can influence the authority servers here), i.e. they are now running 'unbound' (or something like it) on those servers, but they should be running 'nsd' (or something like it).  (i.e. pdns-recursor instead of pdns, or for bind it is a configuration problem (it needs to be a slave or master zone).

Of course it should be fixed in the code, but the above workarounds may help you on the short term.

Best regards,
   Wouter
Comment 2 Wouter Wijngaards 2011-10-24 10:40:47 CEST
Hi Piotr,

The bug has been fixed in svn trunk r2522 of unbound (this is the next release under development).  It works for me now, it spends time on the unresponsive server to see if it responds (it is preferred, if only it answered), so it may takes 10-20 seconds for that, but once it has determined that server is down, resolution works for the misconfigured domain.

In general the probes towards the timeout server will slowdown resolution of this domain.  This is because we want security: the other options are (cache-poisoned?) caches, and the burden (slow resolution) is shouldered by someone who made a (lot) of mistakes configuring their DNS servers.

Best regards,
   Wouter
Comment 3 Piotr Majka 2011-10-24 14:53:49 CEST
Thank you for response and quick fix issue :) I temporary use another workaround:
local-data: "edadfs.partners.extranet.microsoft.com IN A 65.55.28.131"
I have no any special contact with microsoft to post info about bad dns configuration. :)