Bug 1189 - nsd queries over tcp regularly get tcp connection reset or end of file responses
nsd queries over tcp regularly get tcp connection reset or end of file responses
Status: ASSIGNED
Product: NSD
Classification: Unclassified
Component: NSD Code
4.1.x
x86_64 Linux
: P5 minor
Assigned To: NSD team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2016-12-16 18:56 CET by sven.vandyck
Modified: 2017-01-04 18:53 CET (History)
2 users (show)

See Also:


Attachments
debug with connection reset, we also get "end of file" on other requests but this is not in this log (3.71 KB, text/plain)
2016-12-16 18:56 CET, sven.vandyck
Details

Note You need to log in before you can comment on or make changes to this bug.
Description sven.vandyck 2016-12-16 18:56:09 CET
Created attachment 371 [details]
debug with connection reset, we also get "end of file" on other requests but this is not in this log

Dear,

We tested nsd-4.1.12-1.el6.x86_64 on a 2.6.32-642.4.2.el6.x86_64 kernel (CentOS 6.8)

However if we do a query every second over tcp for the SOA of a configured zone, we regularly get "end of file" or "connection reset" response. This  query every second has been run locally on the machine as well as remote, but both yield the same results.

Nothing abnormal has been logged via verbose: 2 setting

This happens roughly every 10 minutes, but can be less or more.

rrl ratelimiting was disabled.

We also monitored system calls to the kernel and the moment we get a "connection reset" this is also reported by the kernel as nsd sending an tcp_send_active_reset.

When we get a "end of file" response from dig, this is not logged by the kernel as a tcp reset.

Any idea what is going on?

Thanks

See attachment for debug
Comment 1 sven.vandyck 2016-12-19 12:51:14 CET
Hi,

After some further troubleshooting, we seemed to have isolated the root cause of this behavior.

We have another zone configured in the same nsd instance that regularly gets notified for IXFR. Roughly once a minute.

The failing tcp queries as described below always happen exactly at the pulling of the IXFR over tcp and the NSD internals reloading afterwards.

If we disable all dynamic updates all our normal tcp queries were successful.

This seems like a bug? Can this be looked into?

Thanks
Comment 2 Willem Toorop 2016-12-19 16:01:05 CET
Thank you Sven,

When a NSD server reloads, it forks of new server processes with the new database to serve.  The old server processes, still containing the old database are then killed by sending it a NSD_QUIT signal.  A server process will quit immediately when it receives this signal.

Unlike UDP, TCP requests can take more than a single processing iteration in a server process to complete.  A server process might still have unfinished TCP sessions when it receives a NSD_QUIT signal because of a reload.  Those TCP sessions will then close because the process is terminating.

We need to consider how to handle finishing TCP sessions within NSD's forking server processes design, so this issue will not be resolved on a very short notice.  Thanks a lot for bringing it to our attention though, because with the increasing interest and demand for TCP with DNS, this is really something that needs to be addressed.

Does this issue cause operational problems for you?

-- Willem
Comment 3 sven.vandyck 2016-12-19 16:25:16 CET
Hi,

Thanks for your reply. Makes things clear.

It isn't causing major operational issues for us now, but with the increased attention to TCP in the future we are more closely following this up. We did not see this behavior in, for example, our bind test servers. So we thought it was worth investigating where the root cause came from.

Do you have rough ETA when this can be fixed? For now a suggestion is to add this to the documentation?

Thanks
Comment 4 Willem Toorop 2016-12-19 16:53:37 CET
We scheduled the design considerations for January next year, so roughly estimating one or one and a half month effort, I roughly expect this to be resolved in February.


(In reply to sven.vandyck from comment #3)
> Hi,
> 
> Thanks for your reply. Makes things clear.
> 
> It isn't causing major operational issues for us now, but with the increased
> attention to TCP in the future we are more closely following this up. We did
> not see this behavior in, for example, our bind test servers. So we thought
> it was worth investigating where the root cause came from.
> 
> Do you have rough ETA when this can be fixed? For now a suggestion is to add
> this to the documentation?
> 
> Thanks
Comment 5 Wouter Wijngaards 2017-01-02 13:30:03 CET
Hi Sven,

I think this config setting may solve the issue for you, the  xfrd-reload-timeout: 1 is set at one second, and perhaps it should be set at 10 seconds.   The frequent reloads are interrupting the other work.  If it was every 10 seconds, the issue would be a lot less, or even perhaps not even there.  Perhaps this tweak solves the issue for you?

Best regards, Wouter
Comment 6 sven.vandyck 2017-01-04 18:53:08 CET
Thanks,

We will try this setting