Bugzilla – Bug 1189
nsd queries over tcp regularly get tcp connection reset or end of file responses
Last modified: 2017-01-04 18:53:08 CET
Created attachment 371 [details]
debug with connection reset, we also get "end of file" on other requests but this is not in this log
We tested nsd-4.1.12-1.el6.x86_64 on a 2.6.32-642.4.2.el6.x86_64 kernel (CentOS 6.8)
However if we do a query every second over tcp for the SOA of a configured zone, we regularly get "end of file" or "connection reset" response. This query every second has been run locally on the machine as well as remote, but both yield the same results.
Nothing abnormal has been logged via verbose: 2 setting
This happens roughly every 10 minutes, but can be less or more.
rrl ratelimiting was disabled.
We also monitored system calls to the kernel and the moment we get a "connection reset" this is also reported by the kernel as nsd sending an tcp_send_active_reset.
When we get a "end of file" response from dig, this is not logged by the kernel as a tcp reset.
Any idea what is going on?
See attachment for debug
After some further troubleshooting, we seemed to have isolated the root cause of this behavior.
We have another zone configured in the same nsd instance that regularly gets notified for IXFR. Roughly once a minute.
The failing tcp queries as described below always happen exactly at the pulling of the IXFR over tcp and the NSD internals reloading afterwards.
If we disable all dynamic updates all our normal tcp queries were successful.
This seems like a bug? Can this be looked into?
Thank you Sven,
When a NSD server reloads, it forks of new server processes with the new database to serve. The old server processes, still containing the old database are then killed by sending it a NSD_QUIT signal. A server process will quit immediately when it receives this signal.
Unlike UDP, TCP requests can take more than a single processing iteration in a server process to complete. A server process might still have unfinished TCP sessions when it receives a NSD_QUIT signal because of a reload. Those TCP sessions will then close because the process is terminating.
We need to consider how to handle finishing TCP sessions within NSD's forking server processes design, so this issue will not be resolved on a very short notice. Thanks a lot for bringing it to our attention though, because with the increasing interest and demand for TCP with DNS, this is really something that needs to be addressed.
Does this issue cause operational problems for you?
Thanks for your reply. Makes things clear.
It isn't causing major operational issues for us now, but with the increased attention to TCP in the future we are more closely following this up. We did not see this behavior in, for example, our bind test servers. So we thought it was worth investigating where the root cause came from.
Do you have rough ETA when this can be fixed? For now a suggestion is to add this to the documentation?
We scheduled the design considerations for January next year, so roughly estimating one or one and a half month effort, I roughly expect this to be resolved in February.
(In reply to sven.vandyck from comment #3)
> Thanks for your reply. Makes things clear.
> It isn't causing major operational issues for us now, but with the increased
> attention to TCP in the future we are more closely following this up. We did
> not see this behavior in, for example, our bind test servers. So we thought
> it was worth investigating where the root cause came from.
> Do you have rough ETA when this can be fixed? For now a suggestion is to add
> this to the documentation?
I think this config setting may solve the issue for you, the xfrd-reload-timeout: 1 is set at one second, and perhaps it should be set at 10 seconds. The frequent reloads are interrupting the other work. If it was every 10 seconds, the issue would be a lot less, or even perhaps not even there. Perhaps this tweak solves the issue for you?
Best regards, Wouter
We will try this setting