Bug 4225 - Clients seem to often erroneously receive NXDOMAIN when querying with DNS-over-TLS on 1.9.0
Clients seem to often erroneously receive NXDOMAIN when querying with DNS-ove...
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.9.0
x86_64 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-02-07 05:48 CET by Michael Marley
Modified: 2019-02-08 16:56 CET (History)
3 users (show)

See Also:


Attachments
Unbound configuration (2.61 KB, application/octet-stream)
2019-02-07 05:48 CET, Michael Marley
Details
Log output from a reproduction of the issue while looking up "reddit.com" (184.27 KB, text/plain)
2019-02-07 05:50 CET, Michael Marley
Details
Log output from a reproduction of the issue while looking up "ppa.launchpad.net" with verbosity 5 (821.02 KB, text/plain)
2019-02-08 12:54 CET, Michael Marley
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael Marley 2019-02-07 05:48:38 CET
Created attachment 559 [details]
Unbound configuration

After upgrading Unbound to 1.9.0, I have found that clients using DNS-over-TLS often seem to erroneously act as if a DNS query returned NXDOMAIN even when it hasn't actually.  This is new in 1.9.0 and did not happen with 1.8.x.

-I have observed the issue both on a desktop Linux system with systemd-resolved with DNS-over-TLS turned on and on an Android 9 device, though it happens significantly more often with the Android device.

-It never happens if I turn DNS-over-TLS off on the client device.

-It seems to happen only for IPv4 A queries, or at least I have never seen it happen for an IPv6 AAAA query.

-Neither TLS session resumption nor TCP fast open seem to have anything to do with the problem; I can disable both and the problem still occurs.

-It seems to happen more often for certain sites, specifically the IPv4 side of ipv6-test.com (v4.ipv6-test.com), reddit.com, and ppa.launchpad.net all seem to fail more often than other domains.

I have attached the unbound.conf I am using and the output for a failed lookup of reddit.com from an Android 9 device with DNS-over-TLS enabled with the log verbosity increased to 9.  If there is anything else I can provide, please let me know.
Comment 1 Michael Marley 2019-02-07 05:50:17 CET
Created attachment 560 [details]
Log output from a reproduction of the issue while looking up "reddit.com"
Comment 2 Wouter Wijngaards 2019-02-07 09:22:14 CET
Hi Michael,

Reddit.com is a CNAME to reddit.map.fastly.net.  But it does not have an AAAA record for reddit.map.fastly.net.  There are no NXDOMAIN or SERVFAIL or DNSSEC failures in that log.

TLS requests are treated the same as non-TLS by Unbound.

Can you do lookups with dig from the client device and also with TLS from the client device, but with a tool like dig, fe. streamtcp is a tool that can do a DNS-over-TLS query and print the output.  That can tell if the problem is server or client. Oh I realise this is an Android device and that may not be possible.

But you could make queries from another client, and that will tell what Unbound is answering, also over TLS.

Best regards, Wouter
Comment 3 Michael Marley 2019-02-07 13:49:17 CET
Thanks for the quick response!

I had already tried fairly extensively to reproduce it with command-line tools like "dig" and "nslookup" with no success (the lookups never failed).  I hadn't tried (or heard of) "streamtcp", but I compiled it and also wrote a script to run it in a tight loop, quitting only if it returns a non-successful exit code.  I ran this script for all the domains I was having trouble with, both separately and all at the same time, and was still unable to reproduce any failures.  However, I used the Android phone to visit one of the sites while this script was running and still got the "Server not found" message from the browser.  I did try using "adb logcat" to see if Android was printing any useful kind of log message, but it is not.  Also, in /var/log/syslog of the desktop Linux systems with systemd-resolved, I get a lot of "Server returned error NXDOMAIN, mitigating potential DNS violation DVE-2018-0001, retrying transaction with reduced feature level UDP+EDNS0."  This is the result of an Ubuntu-specific patch that makes resolved try again with UDP if it gets NXDOMAIN over TCP, but it does seem to indicate that resolved is seeing NXDOMAIN, which is consistent with the "Server not found" message in the browser on Android.  I also suspect this is why the problem is easier to reproduce on Android, since it isn't using any kind of retry logic like that.

I had initially thought that this might be a client-side issue, but the fact that going back to unbound 1.8.x makes it go away and that two completely different pieces of client software are encountering exactly the same problem makes it seem like the server might be at fault.  On the other hand, the fact that I can't reproduce it with any command-line tools indicates that the clients may be at fault.  I am befuddled now.  I will keep trying to reproduce though.
Comment 4 Wouter Wijngaards 2019-02-07 14:10:55 CET
Hi Michael,

Can you retry without the subnetcache enabled?  It could (potentially) give different answers to different source addresses.  But I guess those devices are all on the same subnet to start with, so that doesn't matter.

The ubuntu machine should be able to give the ethernet output - with something like a network analyzer - and the TLS certificate from the server - you should be able to read the transcript of the packet that passed over the wire.  And see what the query was that was sent and the answer that was received.  For the Android as well, but needs the network analyzer on the server, I believe with the server's certificate the analyzer can decode it for you and then you can see what was sent?

You can also enable unbound log-queries and log-replies.  It prints the IP address query name and error code.  So an NXDOMAIN should be visible.  This is printed in the server logs.  Then you can see what the client is querying, what unbound's answer is too.

Are they querying the correct machine even?  Perhaps they send queries to 8.8.8.8 and get public information instead of your local names, eg. NXDOMAIN would be correct according to that server?  log-queries and log-replies could also help here.

But weird that it is solved with 1.8.x.  Is there configuration that is different?

Best regards, Wouter
Comment 5 Michael Marley 2019-02-07 16:07:28 CET
All the devices in question are indeed on the same subnet.

My initial attempts to get TLS decryption working with tcpdump failed due to the fact that newer forward-secrecy ciphers are in use that cannot be MITMed in this way.  I guess I could play around with the newly added cipher list in 1.9.0 and disable those, but I haven't gotten that far yet.

I tried log-queries and log-replies and got the following for an attempt to visit ipv6-test.com on Android.  The first 8 lines are from the initial, unsucessful attempt to load the IPv4 information pane at the top.  The latter two are after I clicked the refresh button at the top-right of the pane and it successfully loaded.

michaelmarley unbound: [11668:0] query: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com. AAAA IN
michaelmarley unbound: [11668:0] reply: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com. AAAA IN NOERROR 0.132399 0 103
michaelmarley unbound: [11668:0] query: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com. A IN
michaelmarley unbound: [11668:0] reply: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com. A IN NOERROR 0.000000 0 61
michaelmarley unbound: [11668:0] query: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com.lan. AAAA IN
michaelmarley unbound: [11668:0] reply: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com.lan. AAAA IN NXDOMAIN 0.000568 0 49
michaelmarley unbound: [11668:0] query: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com.lan. A IN
michaelmarley unbound: [11668:0] reply: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com.lan. A IN NXDOMAIN 0.000499 0 49

michaelmarley unbound: [11668:0] query: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com. A IN
michaelmarley unbound: [11668:0] reply: fdda:5f29:421b:3:8871:923b:529a:2436 v4.ipv6-test.com. A IN NOERROR 0.000000 1 61

It seems to be sending back a non-error response for the "v4.ipv6-test.com. A IN" request, but I guess the client doesn't like whatever it is because it then repeats the query with the ".lan" name (which is correct, since I have this in my DNS search list).  The fact that the initial lookup has a time of 0 and a cache value of 0 seems rather fishy to me, since presumably it would take measurable time to do a lookup if the cache was cold.

Yes, I'm quite sure all the clients are using only the local Unbound instance for DNS.

The only difference in configuration I had between 1.8 and 1.9 was turning on the TLS session resumption in 1.9, but I already tried again with that off the ensure that it wasn't the problem.
Comment 6 Wouter Wijngaards 2019-02-07 16:26:33 CET
Hi Michael,

So the NXDOMAIN and page not found refer to the .lan response.

But the actual failure is that the response for the plain query is ignored and then it moves to the .lan entry instead?  I also do not know how the search list interacts with the sequence of finding a site, that is platform specific and on the client.

The response for the plain query seems like it is the same as when you query from the commandline.  Is there something wrong with it?

You could disable the searchlist (temporarily), that might remove the .lan and then you maybe get to see the error for the first lookup?

Best regards, Wouter
Comment 7 Michael Marley 2019-02-07 20:21:23 CET
I tried disabling the search list and this makes it MUCH harder to reproduce on Android, but I still think there must be something defective about that first response, because I can still occasionally reproduce on desktop systems by running "sudo apt update" and getting "W: Failed to fetch <url>: Could not resolve 'ppa.launchpad.net'".  Here is a log of such a failed occurrence:

michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.archive.canonical.com. SRV IN
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.security.ubuntu.com. SRV IN
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.dl.google.com. SRV IN
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.us.archive.ubuntu.com. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.dl.google.com. SRV IN NXDOMAIN 0.082438 0 103
michaelmarley unbound: [11668:0] query: 10.3.148.85 _http._tcp.dl.google.com. SRV IN
michaelmarley unbound: [11668:0] reply: 10.3.148.85 _http._tcp.dl.google.com. SRV IN NXDOMAIN 0.000000 1 103
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.dl.google.com. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.dl.google.com. SRV IN NXDOMAIN 0.000000 1 103
michaelmarley unbound: [11668:0] query: 10.3.148.85 _http._tcp.dl.google.com. SRV IN
michaelmarley unbound: [11668:0] reply: 10.3.148.85 _http._tcp.dl.google.com. SRV IN NXDOMAIN 0.000000 1 103
michaelmarley unbound: [11668:1] query: 10.3.148.85 dl.google.com. A IN
michaelmarley unbound: [11668:1] query: 10.3.148.85 dl.google.com. AAAA IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 dl.google.com. A IN NOERROR 0.055943 0 125
michaelmarley unbound: [11668:1] reply: 10.3.148.85 dl.google.com. AAAA IN NOERROR 0.085981 0 89
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.security.ubuntu.com. SRV IN NXDOMAIN 0.207705 0 120
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.security.ubuntu.com. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.security.ubuntu.com. SRV IN NXDOMAIN 0.000000 1 120
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.security.ubuntu.com. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.security.ubuntu.com. SRV IN NXDOMAIN 0.000000 1 120
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.security.ubuntu.com. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.security.ubuntu.com. SRV IN NXDOMAIN 0.000000 1 120
michaelmarley unbound: [11668:1] query: 10.3.148.85 security.ubuntu.com. AAAA IN
michaelmarley unbound: [11668:1] query: 10.3.148.85 security.ubuntu.com. A IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 security.ubuntu.com. A IN NOERROR 0.000000 0 144
michaelmarley unbound: [11668:1] reply: 10.3.148.85 security.ubuntu.com. AAAA IN NOERROR 0.057743 0 216
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN NXDOMAIN 0.344387 0 121
michaelmarley unbound: [11668:0] query: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN
michaelmarley unbound: [11668:0] reply: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN NXDOMAIN 0.000000 1 121
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN NXDOMAIN 0.000000 1 121
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN NXDOMAIN 0.000000 1 121
michaelmarley unbound: [11668:1] query: 10.3.148.85 ppa.launchpad.net. A IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 ppa.launchpad.net. A IN NOERROR 0.000000 0 62
michaelmarley unbound: [11668:1] query: 10.3.148.85 ppa.launchpad.net. AAAA IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.us.archive.ubuntu.com. SRV IN NXDOMAIN 0.397214 0 122
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.us.archive.ubuntu.com. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.us.archive.ubuntu.com. SRV IN NXDOMAIN 0.000000 1 122
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.us.archive.ubuntu.com. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.us.archive.ubuntu.com. SRV IN NXDOMAIN 0.000000 1 122
michaelmarley unbound: [11668:0] query: 10.3.148.85 _http._tcp.us.archive.ubuntu.com. SRV IN
michaelmarley unbound: [11668:0] reply: 10.3.148.85 _http._tcp.us.archive.ubuntu.com. SRV IN NXDOMAIN 0.000000 1 122
michaelmarley unbound: [11668:1] query: 10.3.148.85 us.archive.ubuntu.com. A IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 us.archive.ubuntu.com. A IN NOERROR 0.000000 0 82
michaelmarley unbound: [11668:1] query: 10.3.148.85 us.archive.ubuntu.com. AAAA IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.archive.canonical.com. SRV IN NOERROR 0.409473 0 112
michaelmarley unbound: [11668:1] query: 10.3.148.85 archive.canonical.com. AAAA IN
michaelmarley unbound: [11668:1] query: 10.3.148.85 archive.canonical.com. A IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 archive.canonical.com. A IN NOERROR 0.000000 0 98
michaelmarley unbound: [11668:1] reply: 10.3.148.85 ppa.launchpad.net. AAAA IN NOERROR 0.186583 0 110
michaelmarley unbound: [11668:1] query: 10.3.148.85 ppa.launchpad.net. A IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 ppa.launchpad.net. A IN NOERROR 0.000000 0 62
michaelmarley unbound: [11668:1] query: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN
michaelmarley unbound: [11668:1] reply: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN NXDOMAIN 0.000000 1 121
<previous 2 lines repeat numerous times>
michaelmarley unbound: [11668:1] reply: 10.3.148.85 archive.canonical.com. AAAA IN NOERROR 0.124268 0 134
michaelmarley unbound: [11668:1] reply: 10.3.148.85 us.archive.ubuntu.com. AAAA IN NOERROR 0.296041 0 106

It never says it is returning anything other than NOERROR to the A and AAAA requests, yet the client is reporting that it couldn't resolve the name.  I am again seeing the 0-time initial lookup for ppa.launchpad.net even with the cache also being 0.  I also think it is worth mentioning again that I have never once seen this happen for AAAA records; it happens for names with only A records.  It still seems to me that Unbound is returning something incorrect that is causing the client to reject the response for some reason, which also does trigger the second lookup under .lan if the DNSSL is enabled.
Comment 8 Michael Marley 2019-02-07 21:50:38 CET
I have finally found a way to reproduce this 100% of the time!:

-Clear Unbound's cache ("unbound-control flush_zone .")
-Restart systemd-resolved on the client (to clear its cache)
-Run "sudo apt update"

When I do this, I get the "Could not resolve" error for ppa.launchpad.net every time.  Repeating the "sudo apt update" without clearing both caches works as expected.

I can also:

-Clear Unbound's cache ("unbound-control flush_zone .")
-Restart systemd-resolved on the client (to clear its cache)
-Lookup "ppa.launchpad.net" with streamtcp (which succeeds)
-Restart systemd-resolved on the client (to clear its cache)
-Run "sudo apt update"

and I get the same problem on the APT update.  The relevant Unbound log for this is:

Feb  7 15:15:34 michaelmarley unbound: [11668:1] query: fdda:5f29:421b:3:cc0c:8d3b:70a5:fb53 ppa.launchpad.net. A IN
Feb  7 15:15:35 michaelmarley unbound: [11668:1] reply: fdda:5f29:421b:3:cc0c:8d3b:70a5:fb53 ppa.launchpad.net. A IN NOERROR 0.610883 0 62

Feb  7 15:15:45 michaelmarley unbound: [11668:0] query: 10.3.148.85 ppa.launchpad.net. A IN
Feb  7 15:15:45 michaelmarley unbound: [11668:0] reply: 10.3.148.85 ppa.launchpad.net. A IN NOERROR 0.000000 1 62

Feb  7 15:15:55 michaelmarley unbound: [11668:1] query: 10.3.148.85 ppa.launchpad.net. AAAA IN
Feb  7 15:15:55 michaelmarley unbound: [11668:1] query: 10.3.148.85 ppa.launchpad.net. A IN
Feb  7 15:15:55 michaelmarley unbound: [11668:1] reply: 10.3.148.85 ppa.launchpad.net. A IN NOERROR 0.000000 0 62
Feb  7 15:15:55 michaelmarley unbound: [11668:1] reply: 10.3.148.85 ppa.launchpad.net. AAAA IN NOERROR 0.127663 0 110
Feb  7 15:15:55 michaelmarley unbound: [11668:1] query: 10.3.148.85 ppa.launchpad.net. A IN
Feb  7 15:15:55 michaelmarley unbound: [11668:1] reply: 10.3.148.85 ppa.launchpad.net. A IN NOERROR 0.000000 0 62

The first section is from when I used streamtcp to do the lookup, and it shows a non-zero lookup time and a cold cache (as expected).  The second section shows the lookup from dig, which succeeds, shows the correct address, and indicates that the server it queried was the systemd-resolved instance on that box.  The lookup time is 0 and the cache is hot, as expected.  The third section is from when I ran "sudo apt update", which failed, and the timing shows 0 lookup time and a cold cache, which is fishy to me.

None of this happened on 1.8.x, so something must have changed in 1.9.0 to precipitate this.  I'm still trying to work out exactly why the result is different between dig through resolved and apt upgrade through resolved.
Comment 9 Michael Marley 2019-02-07 22:40:33 CET
Since something seemed off with caching, I tried disabling the message and rrset caches (setting their size to 0) and I found that in both cases the problem is worked around.  I think the cache might be the culprit.
Comment 10 Wouter Wijngaards 2019-02-08 07:32:35 CET
Hi Michael,

Since cache turn off works, and you get nxdomain, perhaps it is new defaults for harden-below-nxdomain.  These are from since 1.8.0 not 1.9.0, but it sounds like this might be it.  Also the names you report with trouble are cdn names with multiple labels.  What then can be the problem is that the CDN reports NXDOMAIN for the intermediary labels.  This is a flaw for them, unbound looks at it because of qname-minimisation.  And then harden-below-nxdomain makes a cache response (a 0 time cache NXDOMAIN response) from that NXDOMAIN intermediary label.  (It should have been an empty nonterminal response from the cdn servers, a nodata/noerror response this is called).  But some CDNs have wrong scripting for this.  We had some bugs around it before, so that could have changed the code, perhaps.

If you can reproduce the problem, can you make a verbosity 5 log of unbound with the problem lookups?

Another thing to do, that might fix the problem, is these options:
        # qname-minimisation: yes
        # qname-minimisation-strict: no
        # harden-below-nxdomain: yes

These are the default values, set them to "no", I think that could fix the issue perhaps.  (the qname-minimisation-strict: could also be the cause of this, if you had it enabled).

Best regards, Wouter
Comment 11 Wouter Wijngaards 2019-02-08 07:34:23 CET
Hi Michael,

And also, perhaps 8.8.8.8 gives different answers for those intermediary labels, and then with that you do not have the NXDOMAINs that harden-below-nxdomain then makes the problematic cache response from.  And the others ones give the NXDOMAIN (repeated from the cdn server).

Best regards, Wouter
Comment 12 Wouter Wijngaards 2019-02-08 07:41:52 CET
Hi Michael,

Wait that last comment about 8.8.8.8 is for bug #4226, not here.  Still think that harden-below-nxdomain could create the problem somehow; with the trouble nxdomain records getting in unbound's cache from the upstream somehow (and the CDN maybe giving the bad response all the time or perhaps only in a percentage of cases or something like that?).

Best regards, Wouter
Comment 13 Michael Marley 2019-02-08 12:53:11 CET
Hi Wouter,

I have done some more testing based on your suggestions and found that disabling qname-minimisation does work around the problem, but disabling harden-below-nxdomain does not work around the problem.  I never had qname-minimisation-strict turned on.

I also got the log of a failed lookup at verbosity 5 as you requested.  The name that failed to be looked up was "ppa.launchpad.net".
Comment 14 Michael Marley 2019-02-08 12:54:04 CET
Created attachment 564 [details]
Log output from a reproduction of the issue while looking up "ppa.launchpad.net" with verbosity 5
Comment 15 Wouter Wijngaards 2019-02-08 13:44:30 CET
Hi Michael,

This is wrong:
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: response for _http._tcp.ppa.launchpad.net. SRV IN
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: reply from <launchpad.net.> 91.189.94.173#53
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 0#012;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 #012;; QUESTION SECTION:#012_tcp.ppa.launchpad.net.#011IN#011A#012#012;; ANSWER SECTION:#012#012;; AUTHORITY SECTION:#012launchpad.net.#0113600#011IN#011SOA#011ns1.canonical.com. hostmaster.canonical.com. 2018030238 10800 3600 604800 3600#012#012;; ADDITIONAL SECTION:#012;; MSG SIZE  rcvd: 104
Feb  8 06:47:21 michaelmarley unbound: [14301:1] debug: iter_handle processing q with state QUERY RESPONSE STATE
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: query response was NXDOMAIN ANSWER

It returns NXDOMAIN for the query _tcp.ppa.launchpad.net IN A and server 91.189.94.173 sends that NXDOMAIN.  It is not true, and I think a failure by the CDN owner, or server software, that makes empty nonterminals NXDOMAIN instead of NOERROR/NODATA answers.

The qname validation has fallbacks for it; and this activates.
But this response is also NXDOMAIN.

Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: response for _http._tcp.ppa.launchpad.net. SRV IN
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: reply from <launchpad.net.> 91.189.94.173#53
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 0#012;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 #012;; QUESTION SECTION:#012_http._tcp.ppa.launchpad.net.#011IN#011SRV#012#012;; ANSWER SECTION:#012#012;; AUTHORITY SECTION:#012launchpad.net.#0113600#011IN#011SOA#011ns1.canonical.com. hostmaster.canonical.com. 2018030238 10800 3600 604800 3600#012#012;; ADDITIONAL SECTION:#012;; MSG SIZE  rcvd: 110
Feb  8 06:47:21 michaelmarley unbound: [14301:1] debug: iter_handle processing q with state QUERY RESPONSE STATE
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: query response was NXDOMAIN ANSWER

And this is also NXDOMAIN.  And this turns into the reply:
Feb  8 06:47:21 michaelmarley unbound: [14301:1] reply: 10.3.148.85 _http._tcp.ppa.launchpad.net. SRV IN NXDOMAIN 0.404268 0 121

After a couple repeats, answered from cache, it then gets a query for ppa.launchpad.net A and this is starting to get resolved.

It then has a cache entry from the qname minimisation lookups, and that response for ppa.launchpad.net A contains no addresses.  This is inspected by the validator.  And then turns into this response line.
Feb  8 06:47:21 michaelmarley unbound: [14301:1] reply: 10.3.148.85 ppa.launchpad.net. A IN NOERROR 0.000000 0 62

But without an address you could not connect to it.

And then it tries the AAAA for ppa.launchpad.net.  First the qname tries to get a referral with the type A (to hide for privacy the query type, if there was a referral).  There is no referral there, and the answer contains an actual A record, an IPv4 address!
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: response for ppa.launchpad.net. AAAA IN
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: reply from <launchpad.net.> 91.189.94.173#53
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0#012;; flags: qr aa ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 #012;; QUESTION SECTION:#012ppa.launchpad.net.#011IN#011A#012#012;; ANSWER SECTION:#012ppa.launchpad.net.#011600#011IN#011A#01191.189.95.83#012#012;; AUTHORITY SECTION:#012#012;; ADDITIONAL SECTION:#012;; MSG SIZE  rcvd: 51

It then does not use that, but goes on to try the AAAA because there is no referral, and then receives the answer there is no IPv6 record there.

Then, it looks like again someone asks for the A record and then it has that cache entry, and then lots of _http._tcp SRV lookups that are all NXDOMAIN answered.

And then I see that the actual response to the launchpad.net A query initially, contains the A record.  But this is not present when it is inserted into the cache.  The scrubber did not state that it removed it.  So why is that gone?

Trying to reproduce that I need to fix a flaw in qname minimisation and resumption after nameserver address fetch.

Best regards, Wouter
Comment 16 Wouter Wijngaards 2019-02-08 13:45:30 CET
Hi Michael,

And in the last message I wrote that the NXDOMAIN is wrong, but that was a bad assumption on my part, it was the flaw later on.

Best regards, Wouter
Comment 17 Michael Marley 2019-02-08 13:48:08 CET
Thanks for investigating!  Is there anything else you need from me or are you good for now?
Comment 18 Wouter Wijngaards 2019-02-08 14:34:53 CET
Hi Michael,

So I fixed a bug in qname minimisation, always good to fix bugs, but it may not actually be the problem you have here.  The code change is here just in case.

Index: iterator/iterator.c
===================================================================
--- iterator/iterator.c	(revision 5105)
+++ iterator/iterator.c	(working copy)
@@ -2105,6 +2105,8 @@
 	struct delegpt_addr* target;
 	struct outbound_entry* outq;
 	int auth_fallback = 0;
+	uint8_t* qout_orig = NULL;
+	size_t qout_orig_len = 0;
 
 	/* NOTE: a request will encounter this state for each target it 
 	 * needs to send a query to. That is, at least one per referral, 
@@ -2178,6 +2180,8 @@
 		int labdiff = qchaselabs -
 			dname_count_labels(iq->qinfo_out.qname);
 
+		qout_orig = iq->qinfo_out.qname;
+		qout_orig_len = iq->qinfo_out.qname_len;
 		iq->qinfo_out.qname = iq->qchase.qname;
 		iq->qinfo_out.qname_len = iq->qchase.qname_len;
 		iq->minimise_count++;
@@ -2330,6 +2334,13 @@
 			/* wait to get all targets, we want to try em */
 			verbose(VERB_ALGO, "wait for all targets for fallback");
 			qstate->ext_state[id] = module_wait_reply;
+			/* undo qname minimise step because we'll get back here
+			 * to do it again */
+			if(qout_orig && iq->minimise_count > 0) {
+				iq->minimise_count--;
+				iq->qinfo_out.qname = qout_orig;
+				iq->qinfo_out.qname_len = qout_orig_len;
+			}
 			return 0;
 		}
 		/* did we do enough fallback queries already? */
@@ -2463,6 +2474,13 @@
 				iq->num_current_queries);
 			qstate->ext_state[id] = module_wait_reply;
 		}
+		/* undo qname minimise step because we'll get back here
+		 * to do it again */
+		if(qout_orig && iq->minimise_count > 0) {
+			iq->minimise_count--;
+			iq->qinfo_out.qname = qout_orig;
+			iq->qinfo_out.qname_len = qout_orig_len;
+		}
 		return 0;
 	}
 


Because I just, now that I can make similar queries, I see that the A record is actually there and it looks allright to me.  Let me backtrack to where the problem seems to be, what is going wrong again?

Best regards, Wouter
Comment 19 Wouter Wijngaards 2019-02-08 14:43:11 CET
Hi Michael,

From what I can see, the lookup of ppa.launchpad.net succeeds, and also in your trace.  What exactly seemed to be the problem?  There is an A lookup that happens during the qname minimisation.  That looks like it had the right answer and is later reused to answer things.  That does not happen when you disable qname minimisation, then it'll just query for that name itself.  But it looks like, even though that is 'unusual', it also does not actually go wrong.  From the trace.  And also not for me.

What I fixed, is that qname minimisation skips a label when a nameserver fetch occurs.

The NXDOMAINs seem to be fine, just another query that happens to be NXDOMAIN and this is also looked up.

So I must be confused, you have verbosity high logs of what is going wrong, but I am looking wrong, what is the wrong output?

Best regards, Wouter
Comment 20 Wouter Wijngaards 2019-02-08 14:59:04 CET
Hi Michael,

This is the cache lookup result from the 'log output with verbosity 5' logs,
Feb  8 06:47:21 michaelmarley unbound: [14301:1] info: msg from cache lookup ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0#012;; flags: qr ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 #012;; QUESTION SECTION:#012ppa.launchpad.net.#011IN#011A#012#012;; ANSWER SECTION:#012ppa.launchpad.net.#011600#011IN#011A#01191.189.95.83#012#012;; AUTHORITY SECTION:#012#012;; ADDITIONAL SECTION:#012;; MSG SIZE  rcvd: 51

And it looks fine: ppa.launchpad.net A 91.189.95.83, and that gets returned to the client.  I don't really see what went wrong?

Best regards, Wouter
Comment 21 Michael Marley 2019-02-08 15:18:52 CET
I think you said it best yourself with "But without an address you could not connect to it." in comment 15.  It seems the response getting sent back to the client doesn't have an address in it, which is causing APT to fail with "W: Failed to fetch…Could not resolve 'ppa.launchpad.net'".

The really weird part of this is that it only happens when the message cache is enabled and DNS-over-TLS is in use.  If either of those conditions is not true or if I just do a single lookup for ppa.launchpad.net instead of doing the "sudo apt update", it works properly.  I also checked again and reconfirmed that this does not happen with 1.8.x even with everything else configured in exactly the same way.
Comment 22 Wouter Wijngaards 2019-02-08 15:50:40 CET
Hi Michael,

Yes I managed to replicate this (just once, and I seem to not get that again), but it echos the query back over tcp or tls and that looks like an empty reply, without the QR flag, as you said, and with no data.  Other cases don't have this, and it somehow depends on the order of TCP events, I think.  Trying to reproduce more.

Best regards, Wouter
Comment 23 Wouter Wijngaards 2019-02-08 16:07:03 CET
Hi Michael,

Found it and fixed it.  You were right, it was DNS-over-TLS.  Also for TCP.  What happens is that the client sends multiple queries and the qname-minimisation makes a state machine pattern, where erroneously the query was echoed back to the client instead of the buffer that contained the answer.  Fixed that, code is in the code repository and below:

Index: services/listen_dnsport.c
===================================================================
--- services/listen_dnsport.c	(revision 5105)
+++ services/listen_dnsport.c	(working copy)
@@ -1779,6 +1779,12 @@
 	/* If mesh failed(mallocfail) and called commpoint_send_reply with
 	 * something like servfail then we pick up that reply below. */
 	if(req->is_reply) {
+		/* reply from mesh is in the spool_buffer */
+		sldns_buffer_clear(c->buffer);
+		sldns_buffer_write(c->buffer,
+			sldns_buffer_begin(req->spool_buffer),
+			sldns_buffer_limit(req->spool_buffer));
+		sldns_buffer_flip(c->buffer);
 		goto send_it;
 	}
 

Best regards, Wouter
Comment 24 Michael Marley 2019-02-08 16:08:29 CET
Awesome, thanks!  I will recompile with this patch and retest shortly.
Comment 25 Michael Marley 2019-02-08 16:56:17 CET
I can confirm that the patch fixes the bug, thanks!