Bugzilla – Bug 4226
Clients somehow confused into using AAAA records when using unbound forwarding
Last modified: 2019-02-08 22:53:05 CET
First up, I know this seems like a resolver issue, not a DNS issue :) But I can *only* replicate this with unbound as the forwarder.
This is on Fedora 29 hosts. That's
Name : unbound
Version : 1.8.3
Release : 2.fc29
Arch : x86_64
What's actually happening is a "dnf upgrade" early in our CI run bails out; but I can replicate what happens with a wget. Running twice in a row
[root@fedora-29-vexxhost-vexxhost-sjc1-0002394941 ~]# strace -f -o /tmp/bad.txt wget 'https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f29&arch=x86_64'
--2019-02-08 03:32:15-- https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f29&arch=x86_64
Resolving mirrors.fedoraproject.org (mirrors.fedoraproject.org)... 2610:28:3090:3001:dead:beef:cafe:fed3, 2605:bc80:3010:600:dead:beef:cafe:feda, 2605:bc80:3010:600:dead:beef:cafe:fed9
Connecting to mirrors.fedoraproject.org (mirrors.fedoraproject.org)|2610:28:3090:3001:dead:beef:cafe:fed3|:443... failed: Network is unreachable.
Connecting to mirrors.fedoraproject.org (mirrors.fedoraproject.org)|2605:bc80:3010:600:dead:beef:cafe:feda|:443... failed: Network is unreachable.
Connecting to mirrors.fedoraproject.org (mirrors.fedoraproject.org)|2605:bc80:3010:600:dead:beef:cafe:fed9|:443... failed: Network is unreachable.
[root@fedora-29-vexxhost-vexxhost-sjc1-0002394941 ~]# strace -f -o /tmp/good.txt wget 'https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f29&arch=x86_64'
--2019-02-08 03:32:21-- https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f29&arch=x86_64
Resolving mirrors.fedoraproject.org (mirrors.fedoraproject.org)... 220.127.116.11, 18.104.22.168, 22.214.171.124, ...
Connecting to mirrors.fedoraproject.org (mirrors.fedoraproject.org)|126.96.36.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12112 (12K) [application/metalink+xml]
Saving to: ‘metalink?repo=updates-released-modular-f29&arch=x86_64.26’
The first time, it attempts to connect to the AAAA responses, which it cannot, there's no ipv6 connectivity. At some point it changes it's mind, and uses ipv4. It's not deterministic, both can happen in a short amount of time. I don't believe there is anything that would make it think there's ipv6
[root@fedora-29-vexxhost-vexxhost-sjc1-0002394941 ~]# nmcli
ens3: connected to System ens3
"Red Hat Virtio"
ethernet (virtio_net), FA:16:3E:EC:6E:09, hw, mtu 1500
I can *only* replicate this using unbound with our fowarding configuration:
If I just use 188.8.131.52 I can not get this to happen at all.
I have attached some files that can hopefully help in tracing this.
bad.txt is strace of above, when it tried to access AAAA records
good.txt is strace of above, when it worked with A records. This was run almost immediately after bad.txt
unbound-good-then-bad.txt is the unbound log file for the period, beginning with the request made by bad.txt
My first thought was that is was something like #4188 where unbound was choosing ipv6 over ipv4 for forwarding requests, but it's not quite the same. As mentioned I know choosing between records is not unbound's job, but it only seems to happen when unbound is in between ...
Created attachment 561 [details]
strace of the wget the incorrectly tries ipv6 connections
Created attachment 562 [details]
strace of the wget that correctly tries ipv4 connections
Created attachment 563 [details]
unbound debug log files for the requests
The problem is that server 184.108.40.206#53 does not respond with RRSIG signatures. When the trace is bad, it asks the A query to that server, but receives a response without signatures. The AAAA query (randomly) goes to 220.127.116.11 and this responds with signatures. Then the validator marks the A responds as bogus and it becomes servfail. The AAAA responds succeeds.
So for the client it looks like the ip4 lookup failed and the ip6 lookup succeeded. And it tries to connect to ip6. In reality, it seems to connect to both ip4 and ip6 at the same time, but since it did not get any ip4, you see the ip6 failure printed out.
So, remove the server that is failing to serve DNSSEC responses, 18.104.22.168 from the upstream list. Or disable dnssec validation (don't do this one, instead use upstream servers that support DNSSEC).
Unbound will retry and eventually recover from the failure, but it takes time, and then later things get working because after some retries and (not using the failing upstream server), unbound finds working data, and puts that in the cache.
While this is happening, you see the failing lookup for the fedoraproject servers.
Best regards, Wouter
You should then also remove the IPv6 address that corresponds to that failing server from the upstream server list, if you want things to start to work. I do not know which one that is, but I guess one of the IPv6 addresses corresponds to the IPv4. Or somehow the server is fixed to return the RRSIG records.
Best regards, Wouter
Many thanks for this response; it seems very clear once explained ;) Now I look I see OpenDNS removing dnssec records is a common topic in searches. We have switched our dns forwarders .