Bug 593 - segmentation fault or crash upon rotating logfile
segmentation fault or crash upon rotating logfile
Status: RESOLVED FIXED
Product: unbound
Classification: Unclassified
Component: server
1.4.17
i386 Linux
: P5 major
Assigned To: unbound team
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-07-04 08:23 CEST by Stephane Lapie
Modified: 2014-07-04 09:50 CEST (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Stephane Lapie 2014-07-04 08:23:24 CEST
We are currently considering using unbound as a recursive cache server for our customers, and are testing Unbound 1.4.17 (default version in Debian Wheezy 7.x).

We are in a situation where we need to keep a lot of query logs (verbosity level 2), and to rotate them hourly. However, using "unbound-control log_reopen" after rotating the logfile seems to be provoking a server crash.

I could confirm it three times over one week :
- 2 times with a mistaken configuration ('use_syslog: yes', default chroot value, and 'logfile: "/var/log/unbound/query.log"')
Since the chroot should be in "/etc/unbound" by default, this should not work, according to the manual which states the logfile is opened after chroot()ing, however it does at least upon launching unbound. "unbound-control log_reopen" then leads to a segmentation fault I could retrace back to free() in libc.

Relevant kernel logs :
[2151548.672569] unbound[9598]: segfault at b7786ffc ip b70eb871 sp b6fa7f0c error 4 in libc-2.13.so[b7076000+15d000]
[2228430.488880] unbound[13448]: segfault at b7755ffc ip b70ba871 sp b6775bac error 4 in libc-2.13.so[b7045000+15d000]

- 1 time after fixing the configuration and removing any possible ambiguities ('use_syslog: no', 'chroot: ""', 'logfile: "/var/log/unbound/query.log"')
It commited one second worth of logs (some 650 lines) to the file before the process seemingly just died.



My rotation script is a cron.hourly shell script running /usr/sbin/logrotate /etc/logrotate.d/unbound

/etc/logrotate.d/unbound is as follows :
/var/log/unbound/query.log
{
	rotate 20
	size 1G
	missingok
	notifempty
	postrotate
		/usr/sbin/unbound-control log_reopen > /dev/null
	endscript
}

Also, upon rotating, the last line of a given current log file is always : [<TIMESTAMP>] unbound[<PID>:<TID>] info: control cmd:  log_reopen

I am suspecting this happens with verbose logging enabled (level 2 so we can make out SERVFAIL queries) AND a lot of traffic (we are currently getting some 2000 queries per second), so it feels like there is a race condition issue with the logfile handle... (I do understand that Unbound's focus is not on logging since this has a huge performance toll)

I am thinking about either :
- shifting everything to syslog in the meantime
- creating a python module to figure out the result of a query
which would avoid the supposed logfile issue altogether.



As for why I am enabling verbose logging : There is an ongoing DNS DDoS resource exhaustion attack, using a given domain, and randomly generated non-existent subdomains, which means traditional DNS server negative caching is powerless.

http://dnsamplificationattacks.blogspot.jp/2014/02/authoritative-name-server-attack.html

I therefore have a need to know which domains frequently provoke SERVFAILs or other errors, but not enough that unbound is not mitigating recursive queries on its own, so we can blacklist them on our own. (as long as one remote domain server is capable of answering even slowly, it seems unbound won't SERVFAIL the queries to the parent domain, which means in a twist of logic, that Unbound might end up help hammering the survivors to death during a DDoS)

So we whipped up some logic based on the error logs where if a given domain has more than 5000 SERVFAILs and unique queries in the last hour, we create a local_zone/local_data record to redirect stuff to localhost. This mitigates outgoing traffic, and we check and clean up these every two hours, effectively implementing a 2-hour blacklist scheme for such <randomsubdomain>.whatever.domain.com queries.

We are of course implementing inbound rate-limiting for customers whom we know are open-resolvers (but going through our cache server), and actively working on getting them to fix the issue, but I figured the logfile problem was worth reporting.

Thanks for your time.
Comment 1 Wouter Wijngaards 2014-07-04 09:50:41 CEST
Hi Stephane,

Thank you for the report.  There was a race condition between the use of the logfile variable and the updating of that variable.  I have added a lock to protect the variable in a multithreaded system.

This should fix the issue.  I have not been able to reproduce the crash.

The fix is committed on the svn trunk version of unbound.  The file util/log.c contains the fixes.

Best regards,
   Wouter