Bug 307 - unbound do not response if answer is not in cache while 0x20 is turn on
unbound do not response if answer is not in cache while 0x20 is turn on
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
unspecified
All All
: P2 normal
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-04-20 02:57 CEST by Hua Zhang
Modified: 2010-04-26 17:00 CEST (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Hua Zhang 2010-04-20 02:57:55 CEST
It is caused by some capsfail and REC_LAME server. For example:

 dig +norec @dns.hanqinet.com wWw.forbeScHinamagazine.com

; <<>> DiG 9.5.1-P3 <<>> +norec @dns.hanqinet.com wWw.forbeScHinamagazine.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29887
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.forbeschinamagazine.com.   IN      A

;; ANSWER SECTION:
www.forbeschinamagazine.com. 7201 IN    A       124.172.245.51

;; Query time: 36 msec
;; SERVER: 59.42.254.231#53(59.42.254.231)
;; WHEN: Tue Apr 20 08:06:58 2010
;; MSG SIZE  rcvd: 61

the problem is iter_qstate::num_current_queries is decreased to 0 before calling function processQueryResponse(). Then, num_current_queries will be decreased to -1 in processQueryResponse() and processQueryTargets is called again because the name server is REC_LAME. processQueryTargets will leave a mesh_state doing nothing. As this kind of queries accumulate, unbound will refuse to accept any new query not in cache becasue mesh_area::num_reply_addrs excced num-queries-per-thread*16.

After I try to fix this bug by increase num_current_queries to 1 before calling processQueryResponse(), I found another problem with this kind of query. Unbound  returns SERVFAIL for the query because max attempts is reached for all usable name servers ( max attempts per name server is 5 in unbound, but unbound will try 2 rounds of caps_fallback algorithm which query 2*3*num_of_name_server times).

I found another problem with the caps_fallback algorithm (sorry, not related to this bug). Many replies from auth server contain multiple records in answer section. If the auth server is capsfail, unbound will try to query multiple times and compare the replies. But the multiple records in answer section may be placed in different order in the replies. Unbound will fail to answer this type of query while 0x20 is turn on, for example, mx records of forbeschinamagazine.com.
Comment 1 Wouter Wijngaards 2010-04-20 09:05:49 CEST
Hi Hua Zhang,

Thank you for the bug report.  For sites that do not copy the caps back, we need to improve.  The issue with the fallback going to -1 sounds like a serious bug.  I'll go and look into it.  Reordering records for the comparison, I am not sure what we can do, there is a limit in the case of Akamai-like systems that give (completely) different replies every time, that simply cannot be resolved if they do not copy the 0x20 bits back.

Best regards,
   Wouter
Comment 2 Wouter Wijngaards 2010-04-26 17:00:24 CEST
Fixed the -1 and also the max 5 problem you reported.  Also fixed so that when an rrset does not match right away, it tries to canonicalize and sort the rrs before comparison.  It uses the referral_count to avoid looping (max 30).

Thanks for the report,
   Best regards,
      Wouter