Bug 763 - Lack of DS records from forwarder should not cause long-term validation failures
Lack of DS records from forwarder should not cause long-term validation failures
Status: ASSIGNED
Product: unbound
Classification: Unclassified
Component: server
1.5.8
x86_64 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-05-01 13:29 CEST by Simon Arlott
Modified: 2016-07-06 21:34 CEST (History)
2 users (show)

See Also:


Attachments
Packet capture of failing queries (2.07 KB, application/cap)
2016-07-02 11:42 CEST, Simon Arlott
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Simon Arlott 2016-05-01 13:29:57 CEST
I don't have a packet capture for this, all I can assume is that the upstream resolver (unbound 1.4.22) has temporarily failed to resolve the DS record for some unknown reason (it doesn't log anything).

2016-05-01T11:54:29.841+01:00 unbound@zemphis unbound[1270]: [1270:1b] info: validation failure <github.com. AAAA IN>: no DNSSEC records from 81.2.80.94 for DS github.com. while building chain of trust
2016-05-01T11:54:29.841+01:00 unbound@zemphis unbound[1270]: [1270:15] info: validation failure <github.com. A IN>: no DNSSEC records from 2001:8b0:ffea:0:44d3:22ff:fe06:7dee for DS github.com. while building chain of trust

I don't want to make the upstream resolver ignore the CD flag as I want to be able to make CD queries on the downstream resolver.

This keeps happening to me with .com domains, the DS record somehow goes missing on the downstream resolver but it's available on the upstream resolver.

I think the forwarder configuration needs an option to force multiple retries, because it could return a different answer on the next attempt unlike a normal server that could be expected to be consistent.

I also don't want the validation failure based on missing records to be cached for more than a few seconds, it needs to retry the upstream resolver sooner. Let the upstream resolver decide when to give up.
Comment 1 Wouter Wijngaards 2016-05-02 09:09:35 CEST
Hi Simon,

If unbound is the upstream resolver: give it a trust anchor and it will attempt to fetch valid DNSSEC contents for everything (also +CD flag queries from downstream resolvers).  This includes 5 retries for bogus items.  And also failover to dnssec-supporting nameservers.  And also a short cache lifetime for failed answers.

If unbound is the downstream resolver.  It performs 5 retries for bogus items.  It also has a short cache lifetime for failed answers.  This is configurable with the config option val-bogus-ttl: 60 seconds (the default is 60 seconds).

The upstream forwarder should really be able to resolve dnssec records correctly.  .com is a good hoster, and DS records should work fine.  I guess it is likely the problem is at the forwarder.  val-log-level: 2 makes it print out why it cannot get DNSSEC data.

(is there some sort of middlebox stripping dnssec records away?  Size constraint 512bytes firewall?)  It sounds like there is a problem with the upstream forwarder installation.

Best regards, Wouter
Comment 2 Simon Arlott 2016-05-03 00:20:49 CEST
Both resolvers are running unbound (the downstream resolver is 1.5.8) and they both have the root trust anchor.

val-log-level is set to 2 on both servers but the upstream resolver doesn't log anything when this bug occurs.

There is nothing modifying DNS traffic. I'm now capturing all DNS traffic in and out of the upstream resolver to attempt to determine why this happens.
Comment 3 Simon Arlott 2016-07-02 11:42:29 CEST
Created attachment 333 [details]
Packet capture of failing queries

I've had an unbound resolver suddenly stop working for any query:
2016-07-02T10:09:06.078+01:00 unbound@tsort unbound[1348]: [1348:1] info: validation failure <vger.kernel.org. AAAA IN>: no signatures from 81.2.80.94 for key org. while building chain of trust
2016-07-02T10:09:06.251+01:00 unbound@tsort unbound[1348]: [1348:1] info: validation failure <vger.kernel.org. A IN>: key for validation org. is marked as invalid because of a previous validation failure <vger.kernel.org. AAAA IN>: no signatures from 81.2.80.94 for key org. while building chain of trust
2016-07-02T10:09:06.891+01:00 unbound@tsort unbound[1348]: [1348:1] info: validation failure <vger.kernel.org. MX IN>: key for validation org. is marked as invalid because of a previous validation failure <vger.kernel.org. AAAA IN>: no signatures from 81.2.80.94 for key org. while building chain of trust

I've got a packet capture of this query (dig . +dnssec +cd):
2016-07-02T10:28:22.380+01:00 unbound@tsort unbound[1348]: [1348:6] info: validation failure <. A IN>: no signatures from 81.2.80.94

Both servers are running 1.5.8-1ubuntu1. The client stops setting the DO flag and only recovers when I reload it.
Comment 4 Wouter Wijngaards 2016-07-06 11:06:30 CEST
Hi Simon,

You must be suffering from PMTU problems.  This is why you have intermittent issues.  The .org has a larger keyset (16xx bytes).

Unbound will immediately attempt to fallback to a shorter EDNS size, when there is a timeout from them.  But somehow this is not working; and unbound ends up with a non-EDNS response (with no signatures).

You should really solve your PMTU problems.  This means, you cannot properly receive large responses from the upstream servers.  Fragments, firewalls, ...

Unbound also has options to workaround it; edns-buffer-size: 1400 can perhaps solve the problems for you.  (If it also starts happening on IPv6, 1200).  Unbound will then fallback to using TCP (so, TCP should work to authority servers).

Best regards, Wouter
Comment 5 Simon Arlott 2016-07-06 13:50:44 CEST
(In reply to Wouter Wijngaards from comment #4)
> Hi Simon,
> 
> You must be suffering from PMTU problems.  This is why you have intermittent
> issues.  The .org has a larger keyset (16xx bytes).

I don't have PMTU problems.

# dig +short rs.dns-oarc.net txt
rst.x4050.rs.dns-oarc.net.
rst.x4060.x4050.rs.dns-oarc.net.
rst.x4066.x4060.x4050.rs.dns-oarc.net.
"2001:8b0:ffea:0:44d3:22ff:fe06:7dee DNS reply size limit is at least 4066"
"2001:8b0:ffea:0:44d3:22ff:fe06:7dee sent EDNS buffer size 4096"
"Tested at 2016-07-06 11:48:17 UTC"

# dig +short rs.dns-oarc.net txt
rst.x4050.rs.dns-oarc.net.
rst.x4058.x4050.rs.dns-oarc.net.
rst.x4064.x4058.x4050.rs.dns-oarc.net.
"81.2.80.94 DNS reply size limit is at least 4064"
"81.2.80.94 sent EDNS buffer size 4096"
"Tested at 2016-07-06 11:50:22 UTC"

> Unbound will immediately attempt to fallback to a shorter EDNS size, when
> there is a timeout from them.  But somehow this is not working; and unbound
> ends up with a non-EDNS response (with no signatures).

Why is the client failing to request DNSSEC data in any of its queries?

> You should really solve your PMTU problems.  This means, you cannot properly
> receive large responses from the upstream servers.  Fragments, firewalls, ...

Both servers are on the same network which accepts 1500 byte packets and fragments.
Comment 6 Wouter Wijngaards 2016-07-06 14:06:29 CEST
Hi Simon,

Unbound only does not ask for EDNS if that fails somehow.  I.e. server timeouted first to 1480 size, then without EDNS.  If it responds with NOTIMPL or SERVFAIL or FORMERR it may fallback to nonEDNS.

Sometimes routes are different?  Packets go somewhere else?  Firewalls are redudant and the second one?  The test is not the same as the path between you and the .org authority server, of course, so there could be something there.

The test is good, so this is going to be difficult to find PMTU problem, or something else entirely, but I am not sure what.  The signatures not present and .org keysize 16xx bytes is a clear PMTU trouble indication.

Best regards, Wouter
Comment 7 Wouter Wijngaards 2016-07-06 14:12:14 CEST
Another wild idea, is that someone is using dnsdist or some other thingy, and routes the upstream query wrongly to a non-DNSSEC supporting server, that responds without signatures.
Comment 8 Simon Arlott 2016-07-06 21:34:01 CEST
(In reply to Wouter Wijngaards from comment #6)
> Unbound only does not ask for EDNS if that fails somehow.  I.e. server
> timeouted first to 1480 size, then without EDNS.  If it responds with
> NOTIMPL or SERVFAIL or FORMERR it may fallback to nonEDNS.

Then there needs to be some way to specify in the configuration file that
the forwarder is perfect and must always use EDNS. The client resolver is
configured to require validated data, so querying the forwarder without
requesting DNSSEC RRs is pointless.
 
> Sometimes routes are different?  Packets go somewhere else?  Firewalls are
> redudant and the second one?  The test is not the same as the path between
> you and the .org authority server, of course, so there could be something
> there.

The unbound client is on a VM that is on the server which is running the
unbound server.

> The test is good, so this is going to be difficult to find PMTU problem, or
> something else entirely, but I am not sure what.  The signatures not present
> and .org keysize 16xx bytes is a clear PMTU trouble indication.

In this case the signatures are not present because the client isn't asking for
DNSSEC records...


(In reply to Wouter Wijngaards from comment #7)
> Another wild idea, is that someone is using dnsdist or some other thingy,
> and routes the upstream query wrongly to a non-DNSSEC supporting server,
> that responds without signatures.

There is no other intermediate server or upstream proxy. The unbound client
queries the unbound server and the unbound server directly queries the root
nameservers. My ISP does not do anything to my traffic.