Bug 669 - unbound does not return an A record from cache for request that originally resolved to CNAMEs
unbound does not return an A record from cache for request that originally re...
Status: ASSIGNED
Product: unbound
Classification: Unclassified
Component: server
1.4.22
x86_64 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-04-23 19:36 CEST by Gunther
Modified: 2015-07-09 10:25 CEST (History)
2 users (show)

See Also:


Attachments
Transcript of the commands (6.64 KB, text/plain)
2015-04-30 14:33 CEST, Gunther
Details
log of unbound (125.26 KB, application/x-zip-compressed)
2015-04-30 14:36 CEST, Gunther
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gunther 2015-04-23 19:36:18 CEST
I have a caching unbound instance from Ubuntu 14.04 LTS. Version 1.4.22. 
From time to time, I observe the following.

1. A Windows 7 machine queries for the name resolution of an A record which happens to be mapped to CNAME. Unbound responds with the reply that has both the mapped CNAME and an A record for that CNAME.
2. After some time, for the same request, unbound returns only the CNAME in the reply.
3. Windows 7 does not further query the hostname from that CNAME and reports back to the calling software a resolution failure. It just expects that both the CNAME and an A record for the hostname where the CNAME points to, should be present in the reply.

I do not know if returning only the CNAME is valid under the current RFCs, but adding a workaround for Windows clients to always return the A record along with CNAME would be a good thing.
Comment 1 Wouter Wijngaards 2015-04-24 08:59:57 CEST
Hi Gunther,

So Unbound has logic that should follow the CNAME to fetch the A record.  But if the A record has disappeared from the authoritative servers then I guess unbound would change the results too.

RFCs indicate unbound should return the CNAME and the A record.  And Unbound attempts to do so.

But why is it not happening for you?  Are you sure that the A record does not disappear from the authority servers (or those become unreachable?)?

Best regards, Wouter
Comment 2 Gunther 2015-04-24 16:46:48 CEST
Hello, Wouter.
I am sure that A record does not disappear. When I try to `dig` the A record at the authoritative server from the host where unbound runs, it returns the A record just fine.

Let me know if you need any further debugging information. Packet captures, etc.
Comment 3 Wouter Wijngaards 2015-04-24 16:57:33 CEST
Hi Gunther,

Can you make logs with a high verbosity value (4).  That would need to capture unbound working on the query for the 'only CNAME but not A record' answer.  Not sure if your machine (depending on load) can run with such high verbose levels (volume to log becomes too large).

Can you show me the dig output of the CNAME+A when it is okay?  (+dnssec?)

One thing you could do is : wait for it to fail.
Since you probably missed the exact failure moment.  (where high verbosity could perhaps tell me what is going on).
* can you dig the name that failed (at unbound) and show me the output?  And then, again with +cdflag +dnssec ?
* set verbosity higher (i.e. unbound-control verbosity 4)
and remove the cname from cache, i.e. unbound-control flush_type name A and unbound-control flush_type name CNAME  then dig for the A record.  Then set verbosity back to 1.  And the logfile (or syslog) should have a lot of interesting information.

Is there anything that makes this CNAME different from other CNAMEs?  There are lots of CNAMEs that seem to work, but this one doesn't.  Or the A record.

At high verbosity, a logfile: "some filename" in unbound.conf can be nice, some syslog do not like high throughput and start dropping lines; but that is not what we want here (for that brief moment we are interested in, it is a good idea for keeping the system responsive of course).

Best regards,
   Wouter
Comment 4 Wouter Wijngaards 2015-04-24 17:02:40 CEST
Hi Gunther,

One option that is occurring to me is that the server with the A record may not properly support EDNS or large fragments.  Your 'dig' from the commandline may not have +dnssec, and thus work.  But for unbound; it tries to get DNSSEC with EDNS.  If that answer turns out to be dropped periodically; with some probability; unbound ends up having the A record in cache sometimes, but other times cannot resolve the A record?  (and the CNAME is then fine, and always works).  One way where this could happen is if the A record is signed, and somehow a very large response (>512, >2K) and firewalls (or 'routers') interfere with these packets.

This is obviously only hypothetical, hence my questions about dig output and +dnssec dig outputs.

Best regards,
   Wouter
Comment 5 Gunther 2015-04-30 14:33:49 CEST
Created attachment 281 [details]
Transcript of the commands
Comment 6 Gunther 2015-04-30 14:36:00 CEST
Created attachment 282 [details]
log of unbound
Comment 7 Gunther 2015-04-30 14:39:11 CEST
I have attached the log with verbosity of 4 and the transcript of the commands which where run at the host where unbound runs. The password for the log is "unbound". /etc/resolv.conf on that machine contains 127.0.0.1 as a resolver, so all dig queries are done at the unbound.

Unfortunately, verbosity of 4 was set when the problem manifested itself, not before.
Comment 8 Wouter Wijngaards 2015-05-01 10:38:17 CEST
Hi Gunther,

Looking through the logs and transcript.

192.168.1.254 is configured as the place to forward queries to.  I guess 192.168.1.254 sometimes does not reply for a query for the A record.  And unbound caches this.

When I run these queries for an unbound instance that queries the authority servers directly, it seems to work all right.  The domain itself seems to be EDNS incapable, and it replies FORMERR to EDNS queries.

I guess it would be a good idea to start looking at server 192.168.1.254.

Best regards,
   Wouter
Comment 9 Gunther 2015-05-29 13:56:23 CEST
If the request to 192.168.1.254 times out, is it possible to not cache this and try until the reply is received?

I have read the man page for the configuration file from cover to cover and could not find anything that would allow this.
Comment 10 Wouter Wijngaards 2015-05-29 14:25:39 CEST
Hi Gunther,

Use prefetch: yes, then it will return the cached value, while it has several tries available to attempt to fetch the new value.

Best regards,
   Wouter
Comment 11 Gunther 2015-07-08 19:47:10 CEST
I have switched away from 192.168.1.254 as a forwarder (unbound does all of the querying itself now) and complaints stopped.

Is it possible to have a configuration option for this case? 
1. For example, if unbound issued request times out for a particular record, then unbound would cache this timeout for, say, 1 minute. 
2. If the parameter is set to 0, unbound will try to query every time the recursive request for this record is issued again against unbound and there are no "good" cached entry for the record.
3. Type of cache should internally be as 'timeout'. And unbound should timeout too, for the recursive requests issued against unbound. Or it could be an option to answer with NXDOMAIN, SERVFAIL or just timeout.
Comment 12 Wouter Wijngaards 2015-07-09 10:25:59 CEST
Hi Gunther,

I do not want to make spurious options, that bloats the codebase.

In this case, switching away from 192.168.1.254 seems to have fixed things.

The prefetch: yes option may really do what you want.  While still responding with the cached older entry, it will try to fetch it again.  The timing on it may allow unbound to attempt to fetch it multiple times; which is exactly what you want.

But I would not bother with the prefetch option if the current config is working for you.

Best regards, Wouter