Bugzilla – Full Text Bug Listing |
Summary: | Add total sum to histogram metric / Add integrated support for generating Prometheus metrics ? | ||
---|---|---|---|
Product: | unbound | Reporter: | Ed Schouten <ed> |
Component: | server | Assignee: | unbound team <unbound-team> |
Status: | ASSIGNED --- | ||
Severity: | enhancement | CC: | benno, cathya, jsha, wouter |
Priority: | P5 | ||
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All |
Description
Ed Schouten
2017-11-21 10:05:59 CET
Hi Ed, Thank you for your message on the Prometheus monitoring system and the integration with Unbound. We will discuss the options you mentioned at NLnet Labs. That is, either export statistics via Unbound control or another (integrated) manner, and in a format that can be used by other open source software. An option is to include the integrated code in the Unbound contrib section, but we will need to discuss this. We do think the integration with a system like Prometheus is very useful and some code to facilitate this is valuable. So thanks. Best regards, -- Benno Hi, What do you mean with the total sum for a bucket, what value is it that you want? UBCT1 stats_noreset is already documented in the doc/control_proto_spec.txt file. UBCT1 is a version-numbered unique prefix for the commandline input that follows. It is also possible, and documented, so get statistics via shared memory, stats_shm is the command to unbound-control, and unbound.h exports the definitions that could be used by other programs as well. This has significantly lower CPU requirements than the TLS connection. And it is another way to integrate with unbound's statistics. Best regards, Wouter (In reply to Wouter Wijngaards from comment #2) > What do you mean with the total sum for a bucket, what value is it that you > want? The sum of all of the individual sample values. For example, if there are 10 samples of 1 second and 7 of 3 seconds, then the sum of the sample values would be 10 * 1 + 7 * 3 = 31. Prometheus uses this, because it samples the output regularly. By computing the rate/derivative of both the sum of all the samples and their count, one can obtain the average query latency. The histogram buckets only give insight in quantiles/percentiles; not the average within a timeframe. > UBCT1 stats_noreset is already documented in the doc/control_proto_spec.txt > file. > UBCT1 is a version-numbered unique prefix for the commandline input that > follows. Oh, wow. I overlooked that. I only browsed through the man pages. Thanks! > It is also possible, and documented, so get statistics via shared memory, > stats_shm is the command to unbound-control, and unbound.h exports the > definitions that could be used by other programs as well. This has > significantly lower CPU requirements than the TLS connection. And it is > another way to integrate with unbound's statistics. Awesome! Filed a tracking bug for that on the exporter's side: https://github.com/kumina/unbound_exporter/issues/7 Hi Ed, Nice to hear! The average is printed as recursion.time.avg. Not per histogram bucket, but per thread and globally. And that is the average query latency. This sounds like it is the value you are asking for (all wait times summed, divided by number of items)? Best regards, Wouter Hi Wouter, (In reply to Wouter Wijngaards from comment #4) > The average is printed as recursion.time.avg. Not per histogram bucket, but > per thread and globally. And that is the average query latency. This > sounds like it is the value you are asking for (all wait times summed, > divided by number of items)? That does indeed sound useful. I wasn't sure whether total.recursion.time.avg measured the same thing as the histogram. Thanks for confirming. I've just changed unbound_exporter to make use of it: https://github.com/kumina/unbound_exporter/commit/4f36729f553665a4268b5c265448977276a95096 One thing that we do need to keep an eye out for is whether this approach still guarantees monotonicity. Due to rounding and integer division, it may be the case that avg * count drops in value. That said, let's ignore that for now. Hi Ed, Good to hear! The avg is calculated by keeping track of the sum. It is divided by num.recursivereplies for printing. So avg*count shouldn't drop in value... Best regards, Wouter The entire stack (Unbound + unbound_exporter) now compute this value: x = avg * count = timeval_divide(sum, count) * count As timeval_divide() rounds down its result, it may be the case that an increase of 'count' may yield a smaller total sum. Hi Ed, Yes you are right, it could do that; if that is really a problem for you it is easy enough to printout the timeval sum. Timeval divide also keeps track of the timeval.usec, so the error is likely some number of microseconds. Best regards, Wouter |