Bug 1393 - Problem with CNAME resolving in unbound 1.6.4
Problem with CNAME resolving in unbound 1.6.4
Status: ASSIGNED
Product: unbound
Classification: Unclassified
Component: server
1.6.4
x86_64 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-02 11:23 CEST by Igor Novgorodov
Modified: 2017-08-03 09:46 CEST (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Igor Novgorodov 2017-08-02 11:23:49 CEST
We're using unbound as a recursive resolver for a large number of clients.
After a recent upgrade to 1.6.4 from 1.5.8 a weird problem appeared.

There's a fancy domain mail.google.com that is actually a CNAME to googlemail.l.google.com.

Usually it works fine and unbound resolves it correctly and returns both CNAME and corresponding A record that this CNAME points, like this:
;; ANSWER SECTION:
mail.google.com.        589919  IN      CNAME   googlemail.l.google.com.
googlemail.l.google.com. 104    IN      A       173.194.222.18
googlemail.l.google.com. 104    IN      A       173.194.222.19
googlemail.l.google.com. 104    IN      A       173.194.222.83
googlemail.l.google.com. 104    IN      A       173.194.222.17

But after some time it breaks and starts to return *only* the CNAME, without A record:
;; ANSWER SECTION:
mail.google.com.        573557  IN      CNAME   googlemail.l.google.com.

And consequently the clients fail to resolve the domain.

However, the googlemail.l.google.com domain is in unbound's cache and resolves fine:
;; ANSWER SECTION:
googlemail.l.google.com. 36     IN      A       64.233.162.19
googlemail.l.google.com. 36     IN      A       64.233.162.83
googlemail.l.google.com. 36     IN      A       64.233.162.17
googlemail.l.google.com. 36     IN      A       64.233.162.18

When I flush the mail.google.com from the unbound's cache with 'unbound-control flush_zone mail.google.com' it starts to resolve correctly again until the next time.

We've got another similar instance with an older version of unbound where this problem does not manifest itself.

Thanks for any ideas.
Maybe this is related to some configuration parameters like hardening?
Comment 1 Igor Novgorodov 2017-08-02 11:27:01 CEST
Note: it's not the only problematic domain. We have problems with other differenct CNAMEs too.
Comment 2 Wouter Wijngaards 2017-08-02 12:01:10 CEST
Hi Igor,

Do you have a local-data entry with a CNAME?  For that domain?  Because between those unbound versions the commit 'allow local-data CNAMEs to point to internet names' happened, and I guess this could be where the problem is.

Does it happen every time the A record times out?

Best regards, Wouter
Comment 3 Igor Novgorodov 2017-08-02 12:29:43 CEST
Hi Wouter, thanks for the response.

We have some Google's subdomains in local-data (we're overriding the captive portals in mobile devices):

# unbound-control list_local_zones | grep google.com
android.clients.google.com. static
clients3.google.com. static
clients4.google.com. static

But, as you can see, mail.google.com isn't one of them - I've checked this in the first place. And the other problematic domain that I'm aware of (cdn.echo.msk.ru) isn't in our local-data at all levels.

> Does it happen every time the A record times out?
Hard to say, it looks like a more or less random issue, but TTL is surely involved somehow.

For example, I've made these requests immediately after one another (876 -> 875 CNAME TTL):

;; ANSWER SECTION:
cdn.echo.msk.ru.        876     IN      CNAME   echomsk.cdnvideo.ru.
echomsk.cdnvideo.ru.    21      IN      A       151.236.85.10

;; ANSWER SECTION:
cdn.echo.msk.ru.        875     IN      CNAME   echomsk.cdnvideo.ru.

So there were 21 seconds of TTL and then it's immediately broken.

The same with google:
;; ANSWER SECTION:
mail.google.com.        604153  IN      CNAME   googlemail.l.google.com.
googlemail.l.google.com. 176    IN      A       173.194.73.17
googlemail.l.google.com. 176    IN      A       173.194.73.83
googlemail.l.google.com. 176    IN      A       173.194.73.18
googlemail.l.google.com. 176    IN      A       173.194.73.19

;; ANSWER SECTION:
mail.google.com.        604132  IN      CNAME   googlemail.l.google.com.

There was a lot of TTL remaining and it still broke.

Sometimes the TTL of A-records is refreshed somehow and it works fine for a prolonged period of time (look at the CNAME TTL - it's decreased, but A TTL increased):

;; ANSWER SECTION:
mail.google.com.        604744  IN      CNAME   googlemail.l.google.com.
googlemail.l.google.com. 219    IN      A       64.233.162.17
googlemail.l.google.com. 219    IN      A       64.233.162.18
googlemail.l.google.com. 219    IN      A       64.233.162.19
googlemail.l.google.com. 219    IN      A       64.233.162.83

mail.google.com.        604696  IN      CNAME   googlemail.l.google.com.
googlemail.l.google.com. 267    IN      A       173.194.221.19
googlemail.l.google.com. 267    IN      A       173.194.221.17
googlemail.l.google.com. 267    IN      A       173.194.221.18
googlemail.l.google.com. 267    IN      A       173.194.221.83
Comment 4 Igor Novgorodov 2017-08-02 12:35:09 CEST
> And the other problematic domain that I'm aware of (cdn.echo.msk.ru) isn't in our local-data at all levels.

Correction - actually we have some .msk.ru subdomains if that's relevant:
# unbound-control list_local_zones | grep '\.msk.ru'
fonbet.msk.ru. static
www.vipdosug.msk.ru. static
bukmekerskiekontory.msk.ru. static
Comment 5 Wouter Wijngaards 2017-08-02 14:13:22 CEST
Hi Igor,

Did you compile with ./configure --without-threads (or something like that that disables threading, but then you enabled num-threads > 1 in config, that results in multiple caches in parallel, that could in fact be out of sync, i.e. one contains CNAME, other CNAME+A ? )

When I run this in tryout, I cannot reproduce the issue, it works perfectly.  So I am looking for other oddities.

Best regards, Wouter
Comment 6 Igor Novgorodov 2017-08-02 14:45:08 CEST
No, it's built with threads and libevent, here's the snippet from the .spec file that we're using to build unbound:

%configure  --with-libevent \
            --with-pthreads \
            --disable-rpath \
            --disable-static \
            --disable-sha2 \
            --disable-gost \
            --disable-ecdsa \
            --with-conf-file=%{_sysconfdir}/%{name_orig}/unbound.conf \
            --with-pidfile=%{_localstatedir}/run/%{name_orig}/%{name_orig}.pid \
            --with-rootkey-file=%{_sharedstatedir}/unbound/root.key \
%if %{with_python}
            --with-pythonmodule \
            --with-pyunbound
%endif

And ldd shows that libpthread is linked:

# ldd /usr/sbin/unbound | grep thread
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fb00e416000)

We're using 4 threads.
Here's the unbound.conf just in case: https://novg.net/pastebin/view/c7c8ce96
Comment 7 Wouter Wijngaards 2017-08-02 14:51:51 CEST
Hi Igor,

Now, this is unlikely, but it could be that you have a private-address for the new address of the CNAME?

    private-address: 10.0.0.0/8
    private-address: 172.16.0.0/12
    private-address: 192.168.0.0/16
    private-address: 169.254.0.0/16

But it does not look like that, and google probably, doesn't, but that would result in a CNAME with no A records once the private A records have been removed.

How easy is it to reproduce, can you reproduce and get verbosity 4 log of when it goes wrong?

Best regards, Wouter
Comment 8 Wouter Wijngaards 2017-08-02 14:53:05 CEST
Hi Igor,

I forgot, but you can reset verbosity at runtime with unbound-control verbosity 4 and then verbosity 1 to set it back.

Best regards, Wouter
Comment 9 Wouter Wijngaards 2017-08-02 15:01:27 CEST
Hi Igor,

I found something suspicious, it looks like you have qname-minimisation enabled, and google's CNAME has a weird intermediary zone.

Unbound does not take the A record that is offered with the CNAME, but resolve it itself (to stop cache poisoning), but that means looking up googlemail.l.google.com.  And then it gets for the l.google.com A query (because of qname minimisation):
l.google.com.   60      IN      SOA     ns1.google.com. dns-admin.google.com. 163940845 900 900 1800 60

So it looks like there is some sort of intermediary zone here.  That we did not get a referral for, it is co-hosted parent and child zone.  This is often a case that gives corner-cases issues in some way.

Is it that qname-minimisation is causing a re-lookup (somehow?) to fail?  But the cache entries look fine and haven't timed out...

Best regards, Wouter
Comment 10 Wouter Wijngaards 2017-08-02 15:06:27 CEST
Hi Igor,

Could you try if it still happens if you disable qname-minimisation.  When I try with qname-minimisation it works fine.  But perhaps not for you; if so, and qname minimisation was coded between 1.5.8 - 1.6.4; so that is something to try.

qname minimisation might have made the query return CNAME and NXDOMAIN, but you did not copy that line.  When the A record is missing, is it NXDOMAIN (that might be a telltale sign that qname minimisation has failed, and specifically, got NXDOMAIN for l.google.com or something like that).

Best regards, Wouter
Comment 11 Igor Novgorodov 2017-08-02 15:19:27 CEST
When the replies are already broken - I'm not getting any traces of the request in the log even on verbosity level 4 (maybe only the one-liner request like 'info: 10.34.64.16 cdn.echo.msk.ru. A IN', but this one if filtered by our syslog-ng).

But when I flush the problematic domain from the cache, then I see the debug log on the domain in question.

Here it is: https://novg.net/pastebin/view/19d4e8d6

This instance of Unbound is getting around 10k requests per second, so it's kinda hard to do the debugging.

> Could you try if it still happens if you disable qname-minimisation
I've disabled qname-minimisation using unbound-control for now, will keep an eye on the  system. Currently it seems to work fine. Will keep you posted.

But even if it helps, then something changed\was broken between these versions...
Comment 12 Igor Novgorodov 2017-08-03 09:46:16 CEST
After a night with a disabled there were no issues detected, all CNAMEs are resolving fine. So I think the immediate issue is solved, thank you very much for the advice. But questions remain :)

> When the A record is missing, is it NXDOMAIN (that might be a telltale sign that qname minimisation has failed, and specifically, got NXDOMAIN for l.google.com or something like that).

No, when the A record is missing it still returns NOERROR.
I've just enabled qname-minimisation and the problem immediately manifested itself:

# dig cdn.echo.msk.ru @10.1.1.10

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.4 <<>> cdn.echo.msk.ru @10.1.1.10
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37865
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;cdn.echo.msk.ru.               IN      A

;; ANSWER SECTION:
cdn.echo.msk.ru.        359     IN      CNAME   echomsk.cdnvideo.ru.

;; Query time: 0 msec
;; SERVER: 10.1.1.10#53(10.1.1.10)
;; WHEN: Thu Aug  3 10:38:28 2017
;; MSG SIZE  rcvd: 64