Bug 761 - DNSSEC LAME false positive resolving nic.club
DNSSEC LAME false positive resolving nic.club
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.5.8
x86_64 Linux
: P5 normal
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-04-28 16:28 CEST by Charles Walker
Modified: 2016-05-18 16:08 CEST (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Charles Walker 2016-04-28 16:28:47 CEST
I am seeing a behavior of unbound DNSSEC validation which does not seem right.  I first saw this using unbound 1.5.6 with EDNS client subnet, and have replicated it today using 1.5.8 without EDNS client subnet.

I saw there was an issue when I observed unbound sometimes giving no response to this query (issued against an unbound instance listening on the loopback interface on localhost).

dig @localhost nic.club A +edns=0 +dnssec

When I set logging verbosity to 2, I saw that unbound is logging "query response was DNSSEC LAME" for the authoritative name servers for club.  I believe it eventually runs out of name servers and simply does not respond.

Then I set logging verbosity to 2 on an instance of unbound on which we WERE getting a response to the dig command above.  Even in that case, unbound is cycling through the authoritative name servers for club, until it finds one for which it logs "query response was ANSWER".  I see this in version 1.5.8 as I mentioned in the first paragraph above.

I see this behavior referred to on this page:  https://www.unbound.net/documentation/requirements.html

The snippet below from this page describes this behavior that I am seeing in unbound.

"The following issue needs to be resolved:

a server that serves both a parent and child zone, where parent is signed, but child is not. The server must not be marked lame for the parent zone, because the child answer is not signed."

This behavior of unbound causes poor performance even when unbound gives a response.
Comment 1 Wouter Wijngaards 2016-04-28 16:34:08 CEST
Hi Charles,

They should sign nic.club.  Have less servers to iterate over.  Or just accept the slow lookup.  With 5 min TTL they don't expect caching.

Or host nic.club a different set of servers.

The decision was to put pressure on nonsigned domains, in favor of DNSSEC support.

Best regards, Wouter
Comment 2 Wouter Wijngaards 2016-04-28 16:56:17 CEST
As another point, the negative cache element that makes nic.club insecure is 24h in time, so these slow lookups are infrequent, after the first.

Also, if you have prefetch turned on, if they increase the ttl sufficiently, they can be prefetched and then the slowness is not important since it will simply stay in the cache.  With the slowness covered by the cache prefetch.

I really do not want to remove the search behaviour that unbound has for the best possible answer.  Also not for dnssec signed answers, which is really important to make dnssec deployment easier.

Best regards, Wouter
Comment 3 Charles Walker 2016-05-04 05:23:15 CEST
Hi Wouter,

Like I mentioned, we have seen cases in which unbound does not give any answer to queries for nic.club A.  The behavior of iterating over all of the name servers is something that I noticed when I was investigating unbound not responding.

I turned up logging on unbound and looked at the two different cases (unbound responding, unbound not responding).  Unfortunately, unbound not responding is not easy to reproduce.  However I did get it to happen once with verbosity set to 2.  I could not set verbosity any higher because it was a production system.

I still believe that there is a bug in the case in which unbound does not respond.  For the case of no response, looking at the logs with verbosity set to 2, I saw that while unbound is cycling through name servers looking for a server that does not give a lame answer for nic.club A, it is also cycling through the same set of name servers, alternately trying to resolve A and AAAA queries for one of the name server names (ns4.dns.nic.club. in the case I saw).

I think that the bug is that it does not break out of cycling through the name servers and keeps cycling though them - trying to resolve nic.club A, ns4.dns.nic.club A, and ns4.dns.nic.club AAAA - and it seems to not hit the case which causes the exit from looping through the name servers in the "working" case.  I saw this keep up for 6 seconds until it stopped.  I assume that it got jostled out after that.

Please let me know if you want more info.

Thanks,
Charles
Comment 4 Charles Walker 2016-05-05 16:47:18 CEST
Any more thoughts on comment 3?  I wish I could reproduce the problem that I described in a development environment so that I could really see what's happening.  I suspect that something sets iq->dnssec_lame_query back to 0 in the case I described in comment 3 but I haven't been able to find it yet.  Maybe the call to iter_server_selection in processQueryTargets?
Comment 5 Wouter Wijngaards 2016-05-18 16:08:13 CEST
Hi Charles,

After explanation from Glen, I understand what sort of solution is appropriate.

I think the solution below (also in svn trunk, and in upcoming releases), solves the problem and does not require build time tweaks or config options.

It performs max 3 retries because of DNSSEClameness, not 'infinite' that was causing the looping issues.

Best regards, Wouter

Index: iterator/iterator.c
===================================================================
--- iterator/iterator.c	(revision 3719)
+++ iterator/iterator.c	(working copy)
@@ -2174,6 +2174,7 @@
 	}
 	if(iq->dnssec_expected && !iq->dnssec_lame_query &&
 		!(iq->chase_flags&BIT_RD) 
+		&& iq->sent_count < DNSSEC_LAME_DETECT_COUNT
 		&& type != RESPONSE_TYPE_LAME 
 		&& type != RESPONSE_TYPE_REC_LAME 
 		&& type != RESPONSE_TYPE_THROWAWAY 
Index: iterator/iterator.h
===================================================================
--- iterator/iterator.h	(revision 3719)
+++ iterator/iterator.h	(working copy)
@@ -61,6 +61,9 @@
 #define MAX_REFERRAL_COUNT	130
 /** max number of queries-sent-out.  Make sure large NS set does not loop */
 #define MAX_SENT_COUNT		32
+/** max number of queries for which to perform dnsseclameness detection,
+ * (rrsigs misssing detection) after that, just pick up that response */
+#define DNSSEC_LAME_DETECT_COUNT 4
 /**
  * max number of QNAME minimisation iterations. Limits number of queries for
  * QNAMEs with a lot of labels.