Bug 752 - cmd channel thread fails, creating zombie
cmd channel thread fails, creating zombie
Product: unbound
Classification: Unclassified
Component: server
x86_64 OpenBSD
: P5 critical
Assigned To: unbound team
Depends on:
  Show dependency treegraph
Reported: 2016-04-10 12:30 CEST by jduricic
Modified: 2017-10-31 11:23 CET (History)
4 users (show)

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description jduricic 2016-04-10 12:30:50 CEST
Every couple of days of querying for statistics with "/usr/sbin/unbound-control stats" [query done every minute] command channel thread fails - leaving thread hanging. Process cant be gracefully shut, and kill -9 is required to kill the thread directly.

This is /var/log/messages output:
unbound: [24936:2] error: tube msg write failed: Resource temporarily unavailable
unbound: [24936:2] fatal error: could not write stat values over cmd channel

ps axw output:
24936 ??  Z       0:00.00 (unbound)
14477 ??  Is      1:05.08 unbound -c /var/unbound/etc/unbound.conf
 7353 ??  S       0:12.23 unbound -c /var/unbound/etc/unbound.conf
31491 ??  S       0:10.07 unbound -c /var/unbound/etc/unbound.conf

as you can see from ps output there is now a zombie thread.
Comment 1 Wouter Wijngaards 2016-04-11 11:22:31 CEST
Hi jduricic,

The failure to write on a pipe with resource unavailable, means that the OpenBSD kernel is out of buffers.

You should increase that buffer space (perhaps with sysconf, rc.conf or recompile, I don't know), or perhaps use less network-socket-buffers by having less streams active, I guess.

Best regards, Wouter
Comment 2 jduricic 2016-04-27 07:31:17 CEST

sorry for my late reply, testing and observing, so kernbuffers is set to 128000, so I cant see that as a problem.

Also as a side note it happens also on servers without any load on them so that is a bit weird as well.


Comment 3 jduricic 2016-04-27 08:17:39 CEST
kernbuffers I meant maxcluster:

#netstat -m output
77 mbufs in use:
        66 mbufs allocated to data
        5 mbufs allocated to packet headers
        6 mbufs allocated to socket names and addresses
61/336/128000 mbuf 2048 byte clusters in use (current/peak/max)
0/8/128000 mbuf 4096 byte clusters in use (current/peak/max)
0/8/128000 mbuf 8192 byte clusters in use (current/peak/max)
0/14/128002 mbuf 9216 byte clusters in use (current/peak/max)
0/10/128000 mbuf 12288 byte clusters in use (current/peak/max)
0/8/128000 mbuf 16384 byte clusters in use (current/peak/max)
0/8/128000 mbuf 65536 byte clusters in use (current/peak/max)
1180 Kbytes allocated to network (11% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
Comment 4 Wouter Wijngaards 2016-04-28 09:13:04 CEST

It happens for servers without any load?  With resource unavailable?

Why does the kernel set errno to resource unavailable for a write call to a pipe for a computer without load?

Best regards, Wouter
Comment 5 giannici 2016-06-02 11:19:12 CEST
We too experience this exact problem.

We issue the "unbound-control stats_noreset" command every 5 minutes, and after a few days or weeks the unbound-control stops working and one of the unbound processes is a zombie.

Maybe there is some kind or memory/socket leak in the stats code?
Comment 6 Wouter Wijngaards 2016-06-02 11:36:24 CEST

Perhaps this will fix it, it retries, since the error is called 'temporarily unavailable'.  Fix is also in source code repository.

Index: util/tube.c
--- util/tube.c	(revision 3743)
+++ util/tube.c	(working copy)
@@ -304,6 +304,8 @@
 	d = r;
 	while(d != (ssize_t)sizeof(len)) {
 		if((r=write(fd, ((char*)&len)+d, sizeof(len)-d)) == -1) {
+			if(errno == EAGAIN)
+				continue; /* temporarily unavail: try again*/
 			log_err("tube msg write failed: %s", strerror(errno));
 			return 0;
@@ -313,6 +315,8 @@
 	d = 0;
 	while(d != (ssize_t)len) {
 		if((r=write(fd, buf+d, len-d)) == -1) {
+			if(errno == EAGAIN)
+				continue; /* temporarily unavail: try again*/
 			log_err("tube msg write failed: %s", strerror(errno));
 			return 0;
Comment 7 jduricic 2017-10-31 11:23:37 CET
Confirmed as fixed.