Bug 596 - stuck SSL connection from unbound-control blocks unbound worker threads
stuck SSL connection from unbound-control blocks unbound worker threads
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.4.22
i386 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-15 16:33 CEST by Carsten Strotmann
Modified: 2014-07-16 12:08 CEST (History)
1 user (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Carsten Strotmann 2014-07-15 16:33:59 CEST
this issue appeared during a DDoS attack on an unbound server (ISP)
that resulted in TCP resources in the linux-kernel to be exhausted
(SYN buffers).

During this incident, the operator tried to gather information from
the system using "unbound-control". Because of the TCP resource
exhaustion, the SSL connection between unbound and unbound-control
want stale and the unbound-thread was stuck in a retry loop writing
data to the SSL connection.

the Unbound log was showing hundrets of errors in the form of:

Jul 15 10:56:32 unbound[32725:0] error: could not SSL_write crypto error:00000000:lib(0):func(0):reason(0)
Jul 15 10:56:32 unbound[32725:0] error: could not SSL_write crypto error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
Jul 15 10:56:32 unbound[32725:0] error: could not SSL_write crypto error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry 

It seems there is no limit to the retries. While the unbound-thread
was writing to the SSL, it was not working on queries. Because
multiple unbound-control instances where started, after some time all
threads were blocked and all queries to this Unbound DNS-resolver were
timing out.

The solution was to kill (hard kill with SIGKILL) the "hanging"
unbound-control processes. Killing the processes unblocked the
Unbound-threads and Unbound continued to work on queries.

The SSL connection between unbound and unbound-control should have a
retry limit in writing data to the connection, if it fails for too
long, both sides should recover gracefully and continue.
Comment 1 Wouter Wijngaards 2014-07-15 16:37:56 CEST
Hi Carsten,

In unbound, at this point, it would notice the ssl write error and stop execution of the remote-control loop.  Can you tell me which commands the operator tried to run?  It should print this error maybe once or twice but not forever.

Best regards,
   Wouter
Comment 2 Wouter Wijngaards 2014-07-15 16:41:20 CEST
Hi Carsten,

Did he do the unbound-control command 'list_local_zones' by chance?  And he has hundreds of those zones?  That would look like a plausible failure (from reading the code) and a plausible fix for that too.

Best regards,
   Wouter
Comment 3 Carsten Strotmann 2014-07-15 18:20:44 CEST
Hi Wouter,

I cannot reconstruct which unbound-control commands were running, I remember that at least "dump_requestlist" was among them. I'll ask about 'list_local_zones'.

I had a strace connected to the unbound process in the main terminal (which had a previous output of all unbound-control commands), the strace was showing the write to the SSL connection.

Once I've killed the stuck unbound control, Unbound started to work as normal, strace logged tons of output and removed my history buffer :( (Next time I run a "script" session).

-- Carsten
Comment 4 Carsten Strotmann 2014-07-15 18:25:45 CEST
Hi Wouter,

the command that created the "hang" was 

sudo /usr/local/sbin/unbound-control dump_infra
Comment 5 Carsten Strotmann 2014-07-16 11:22:34 CEST
(In reply to comment #4)
> Hi Wouter,
> 
> the command that created the "hang" was 
> 
> sudo /usr/local/sbin/unbound-control dump_infra

The command "unbound-control dump_infra" was called and terminated by the user with CTRL+C before all data was received. 

The command after stopping "unbound-control dump_infra" was

sudo  /usr/local/sbin/unbound-control dump_requestlist 

and that command never returned.
Comment 6 Wouter Wijngaards 2014-07-16 12:08:59 CEST
Hi Carsten,

Thanks, I also fixed the dump_infra command.  This was less obviously broken for the ssl_write failures.  dump_requestlist code looks to be fine, unbound was probably too busy with the ssl_write failures at that time.

That should fix this bug!

Best regards, Wouter