Bug 4226 - Clients somehow confused into using AAAA records when using unbound forwarding
Clients somehow confused into using AAAA records when using unbound forwarding
Status: RESOLVED INVALID
Product: unbound
Classification: Unclassified
Component: server
1.8.3
x86_64 Linux
: P5 enhancement
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-02-08 06:40 CET by Ian Wienand
Modified: 2019-02-08 22:53 CET (History)
2 users (show)

See Also:


Attachments
strace of the wget the incorrectly tries ipv6 connections (62.94 KB, text/plain)
2019-02-08 06:42 CET, Ian Wienand
Details
strace of the wget that correctly tries ipv4 connections (105.32 KB, text/plain)
2019-02-08 06:43 CET, Ian Wienand
Details
unbound debug log files for the requests (448.65 KB, text/plain)
2019-02-08 06:44 CET, Ian Wienand
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ian Wienand 2019-02-08 06:40:07 CET
First up, I know this seems like a resolver issue, not a DNS issue :)  But I can *only* replicate this with unbound as the forwarder.

This is on Fedora 29 hosts.  That's

---
Name         : unbound
Version      : 1.8.3
Release      : 2.fc29
Arch         : x86_64
---

What's actually happening is a "dnf upgrade" early in our CI run bails out; but I can replicate what happens with a wget.  Running twice in a row

---
[root@fedora-29-vexxhost-vexxhost-sjc1-0002394941 ~]# strace -f -o /tmp/bad.txt wget 'https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f29&arch=x86_64'
--2019-02-08 03:32:15--  https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f29&arch=x86_64
Resolving mirrors.fedoraproject.org (mirrors.fedoraproject.org)... 2610:28:3090:3001:dead:beef:cafe:fed3, 2605:bc80:3010:600:dead:beef:cafe:feda, 2605:bc80:3010:600:dead:beef:cafe:fed9
Connecting to mirrors.fedoraproject.org (mirrors.fedoraproject.org)|2610:28:3090:3001:dead:beef:cafe:fed3|:443... failed: Network is unreachable.
Connecting to mirrors.fedoraproject.org (mirrors.fedoraproject.org)|2605:bc80:3010:600:dead:beef:cafe:feda|:443... failed: Network is unreachable.
Connecting to mirrors.fedoraproject.org (mirrors.fedoraproject.org)|2605:bc80:3010:600:dead:beef:cafe:fed9|:443... failed: Network is unreachable.

[root@fedora-29-vexxhost-vexxhost-sjc1-0002394941 ~]# strace -f -o /tmp/good.txt wget 'https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f29&arch=x86_64'
--2019-02-08 03:32:21--  https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f29&arch=x86_64
Resolving mirrors.fedoraproject.org (mirrors.fedoraproject.org)... 152.19.134.198, 140.211.169.196, 209.132.181.15, ...
Connecting to mirrors.fedoraproject.org (mirrors.fedoraproject.org)|152.19.134.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12112 (12K) [application/metalink+xml]
Saving to: ‘metalink?repo=updates-released-modular-f29&arch=x86_64.26’
---

The first time, it attempts to connect to the AAAA responses, which it cannot, there's no ipv6 connectivity.  At some point it changes it's mind, and uses ipv4.  It's not deterministic, both can happen in a short amount of time.  I don't believe there is anything that would make it think there's ipv6

---
[root@fedora-29-vexxhost-vexxhost-sjc1-0002394941 ~]# nmcli
ens3: connected to System ens3
        "Red Hat Virtio"
        ethernet (virtio_net), FA:16:3E:EC:6E:09, hw, mtu 1500
        ip4 default
        inet4 38.108.68.82/24
        route4 169.254.169.254/32
        route4 0.0.0.0/0
        route4 38.108.68.0/24
        inet6 fe80::f816:3eff:feec:6e09/64
        route6 ff00::/8
        route6 fe80::/64
---

I can *only* replicate this using unbound with our fowarding configuration:

---
forward-zone:
  name: "."
  forward-addr: 2620:0:ccc::2
  forward-addr: 2001:4860:4860::8888
  forward-addr: 208.67.222.222
  forward-addr: 8.8.8.8
---

If I just use 8.8.8.8 I can not get this to happen at all.

I have attached some files that can hopefully help in tracing this.

bad.txt is strace of above, when it tried to access AAAA records
good.txt is strace of above, when it worked with A records.  This was run almost immediately after bad.txt
unbound-good-then-bad.txt is the unbound log file for the period, beginning with the request made by bad.txt

My first thought was that is was something like #4188 where unbound was choosing ipv6 over ipv4 for forwarding requests, but it's not quite the same.  As mentioned I know choosing between records is not unbound's job, but it only seems to happen when unbound is in between ...
Comment 1 Ian Wienand 2019-02-08 06:42:17 CET
Created attachment 561 [details]
strace of the wget the incorrectly tries ipv6 connections
Comment 2 Ian Wienand 2019-02-08 06:43:00 CET
Created attachment 562 [details]
strace of the wget that correctly tries ipv4 connections
Comment 3 Ian Wienand 2019-02-08 06:44:09 CET
Created attachment 563 [details]
unbound debug log files for the requests
Comment 4 Wouter Wijngaards 2019-02-08 07:54:02 CET
Hi Ian,

The problem is that server 208.67.222.222#53 does not respond with RRSIG signatures.  When the trace is bad, it asks the A query to that server, but receives a response without signatures.  The AAAA query (randomly) goes to 8.8.8.8 and this responds with signatures.  Then the validator marks the A responds as bogus and it becomes servfail.  The AAAA responds succeeds.

So for the client it looks like the ip4 lookup failed and the ip6 lookup succeeded.  And it tries to connect to ip6.  In reality, it seems to connect to both ip4 and ip6 at the same time, but since it did not get any ip4, you see the ip6 failure printed out.

So, remove the server that is failing to serve DNSSEC responses, 208.67.222.222 from the upstream list.  Or disable dnssec validation (don't do this one, instead use upstream servers that support DNSSEC).

Unbound will retry and eventually recover from the failure, but it takes time, and then later things get working because after some retries and (not using the failing upstream server), unbound finds working data, and puts that in the cache.
While this is happening, you see the failing lookup for the fedoraproject servers.

Best regards, Wouter
Comment 5 Wouter Wijngaards 2019-02-08 07:57:04 CET
Hi Ian,

You should then also remove the IPv6 address that corresponds to that failing server from the upstream server list, if you want things to start to work.  I do not know which one that is, but I guess one of the IPv6 addresses corresponds to the IPv4.  Or somehow the server is fixed to return the RRSIG records.

Best regards, Wouter
Comment 6 Ian Wienand 2019-02-08 22:53:05 CET
Many thanks for this response; it seems very clear once explained ;)  Now I look I see OpenDNS removing dnssec records is a common topic in searches.  We have switched our dns forwarders [1].

[1] https://review.openstack.org/#/c/635895/