Line data Source code
1 : /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
2 : /* vim: set ts=8 sts=2 et sw=2 tw=80: */
3 : /* This Source Code Form is subject to the terms of the Mozilla Public
4 : * License, v. 2.0. If a copy of the MPL was not distributed with this
5 : * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
6 :
7 : #ifndef mozilla_FastBernoulliTrial_h
8 : #define mozilla_FastBernoulliTrial_h
9 :
10 : #include "mozilla/Assertions.h"
11 : #include "mozilla/XorShift128PlusRNG.h"
12 :
13 : #include <cmath>
14 : #include <stdint.h>
15 :
16 : namespace mozilla {
17 :
18 : /**
19 : * class FastBernoulliTrial: Efficient sampling with uniform probability
20 : *
21 : * When gathering statistics about a program's behavior, we may be observing
22 : * events that occur very frequently (e.g., function calls or memory
23 : * allocations) and we may be gathering information that is somewhat expensive
24 : * to produce (e.g., call stacks). Sampling all the events could have a
25 : * significant impact on the program's performance.
26 : *
27 : * Why not just sample every N'th event? This technique is called "systematic
28 : * sampling"; it's simple and efficient, and it's fine if we imagine a
29 : * patternless stream of events. But what if we're sampling allocations, and the
30 : * program happens to have a loop where each iteration does exactly N
31 : * allocations? You would end up sampling the same allocation every time through
32 : * the loop; the entire rest of the loop becomes invisible to your measurements!
33 : * More generally, if each iteration does M allocations, and M and N have any
34 : * common divisor at all, most allocation sites will never be sampled. If
35 : * they're both even, say, the odd-numbered allocations disappear from your
36 : * results.
37 : *
38 : * Ideally, we'd like each event to have some probability P of being sampled,
39 : * independent of its neighbors and of its position in the sequence. This is
40 : * called "Bernoulli sampling", and it doesn't suffer from any of the problems
41 : * mentioned above.
42 : *
43 : * One disadvantage of Bernoulli sampling is that you can't be sure exactly how
44 : * many samples you'll get: technically, it's possible that you might sample
45 : * none of them, or all of them. But if the number of events N is large, these
46 : * aren't likely outcomes; you can generally expect somewhere around P * N
47 : * events to be sampled.
48 : *
49 : * The other disadvantage of Bernoulli sampling is that you have to generate a
50 : * random number for every event, which can be slow.
51 : *
52 : * [significant pause]
53 : *
54 : * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli
55 : * sampling, while generating a fresh random number only when we do decide to
56 : * sample an event, not on every trial. When it decides not to sample, a call to
57 : * |FastBernoulliTrial::trial| is nothing but decrementing a counter and
58 : * comparing it to zero. So the lower your sampling probability is, the less
59 : * overhead FastBernoulliTrial imposes.
60 : *
61 : * Probabilities of 0 and 1 are handled efficiently. (In neither case need we
62 : * ever generate a random number at all.)
63 : *
64 : * The essential API:
65 : *
66 : * - FastBernoulliTrial(double P)
67 : * Construct an instance that selects events with probability P.
68 : *
69 : * - FastBernoulliTrial::trial()
70 : * Return true with probability P. Call this each time an event occurs, to
71 : * decide whether to sample it or not.
72 : *
73 : * - FastBernoulliTrial::trial(size_t n)
74 : * Equivalent to calling trial() |n| times, and returning true if any of those
75 : * calls do. However, like trial, this runs in fast constant time.
76 : *
77 : * What is this good for? In some applications, some events are "bigger" than
78 : * others. For example, large allocations are more significant than small
79 : * allocations. Perhaps we'd like to imagine that we're drawing allocations
80 : * from a stream of bytes, and performing a separate Bernoulli trial on every
81 : * byte from the stream. We can accomplish this by calling |t.trial(S)| for
82 : * the number of bytes S, and sampling the event if that returns true.
83 : *
84 : * Of course, this style of sampling needs to be paired with analysis and
85 : * presentation that makes the size of the event apparent, lest trials with
86 : * large values for |n| appear to be indistinguishable from those with small
87 : * values for |n|.
88 : */
89 : class FastBernoulliTrial {
90 : /*
91 : * This comment should just read, "Generate skip counts with a geometric
92 : * distribution", and leave everyone to go look that up and see why it's the
93 : * right thing to do, if they don't know already.
94 : *
95 : * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE...
96 : *
97 : * Instead of generating a fresh random number for every trial, we can
98 : * randomly generate a count of how many times we should return false before
99 : * the next time we return true. We call this a "skip count". Once we've
100 : * returned true, we generate a fresh skip count, and begin counting down
101 : * again.
102 : *
103 : * Here's an awesome fact: by exercising a little care in the way we generate
104 : * skip counts, we can produce results indistinguishable from those we would
105 : * get "rolling the dice" afresh for every trial.
106 : *
107 : * In short, skip counts in Bernoulli trials of probability P obey a geometric
108 : * distribution. If a random variable X is uniformly distributed from [0..1),
109 : * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric
110 : * distribution for the skip counts.
111 : *
112 : * Why that formula?
113 : *
114 : * Suppose we're to return |true| with some probability P, say, 0.3. Spread
115 : * all possible futures along a line segment of length 1. In portion P of
116 : * those cases, we'll return true on the next call to |trial|; the skip count
117 : * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more.
118 : *
119 : * skip: 0 1 or more
120 : * |------------------^-----------------------------------------|
121 : * portion: 0.3 0.7
122 : * P 1-P
123 : *
124 : * But the "1 or more" section of the line is subdivided the same way: *within
125 : * that section*, in portion P the second call to |trial()| returns true, and in
126 : * portion 1-P it returns false a second time; the skip count is two or more.
127 : * So we return true on the second call in proportion 0.7 * 0.3, and skip at
128 : * least the first two in proportion 0.7 * 0.7.
129 : *
130 : * skip: 0 1 2 or more
131 : * |------------------^------------^----------------------------|
132 : * portion: 0.3 0.7 * 0.3 0.7 * 0.7
133 : * P (1-P)*P (1-P)^2
134 : *
135 : * We can continue to subdivide:
136 : *
137 : * skip >= 0: |------------------------------------------------- (1-P)^0 --|
138 : * skip >= 1: | ------------------------------- (1-P)^1 --|
139 : * skip >= 2: | ------------------ (1-P)^2 --|
140 : * skip >= 3: | ^ ---------- (1-P)^3 --|
141 : * skip >= 4: | . --- (1-P)^4 --|
142 : * .
143 : * ^X, see below
144 : *
145 : * In other words, the likelihood of the next n calls to |trial| returning
146 : * false is (1-P)^n. The longer a run we require, the more the likelihood
147 : * drops. Further calls may return false too, but this is the probability
148 : * we'll skip at least n.
149 : *
150 : * This is interesting, because we can pick a point along this line segment
151 : * and see which skip count's range it falls within; the point X above, for
152 : * example, is within the ">= 2" range, but not within the ">= 3" range, so it
153 : * designates a skip count of 2. So if we pick points on the line at random
154 : * and use the skip counts they fall under, that will be indistinguishable
155 : * from generating a fresh random number between 0 and 1 for each trial and
156 : * comparing it to P.
157 : *
158 : * So to find the skip count for a point X, we must ask: To what whole power
159 : * must we raise 1-P such that we include X, but the next power would exclude
160 : * it? This is exactly std::floor(std::log(X) / std::log(1-P)).
161 : *
162 : * Our algorithm is then, simply: When constructed, compute an initial skip
163 : * count. Return false from |trial| that many times, and then compute a new skip
164 : * count.
165 : *
166 : * For a call to |trial(n)|, if the skip count is greater than n, return false
167 : * and subtract n from the skip count. If the skip count is less than n,
168 : * return true and compute a new skip count. Since each trial is independent,
169 : * it doesn't matter by how much n overshoots the skip count; we can actually
170 : * compute a new skip count at *any* time without affecting the distribution.
171 : * This is really beautiful.
172 : */
173 : public:
174 : /**
175 : * Construct a fast Bernoulli trial generator. Calls to |trial()| return true
176 : * with probability |aProbability|. Use |aState0| and |aState1| to seed the
177 : * random number generator; both may not be zero.
178 : */
179 315 : FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1)
180 315 : : mProbability(0)
181 : , mInvLogNotProbability(0)
182 : , mGenerator(aState0, aState1)
183 315 : , mSkipCount(0)
184 : {
185 315 : setProbability(aProbability);
186 315 : }
187 :
188 : /**
189 : * Return true with probability |mProbability|. Call this each time an event
190 : * occurs, to decide whether to sample it or not. The lower |mProbability| is,
191 : * the faster this function runs.
192 : */
193 0 : bool trial() {
194 0 : if (mSkipCount) {
195 0 : mSkipCount--;
196 0 : return false;
197 : }
198 :
199 0 : return chooseSkipCount();
200 : }
201 :
202 : /**
203 : * Equivalent to calling trial() |n| times, and returning true if any of those
204 : * calls do. However, like trial, this runs in fast constant time.
205 : *
206 : * What is this good for? In some applications, some events are "bigger" than
207 : * others. For example, large allocations are more significant than small
208 : * allocations. Perhaps we'd like to imagine that we're drawing allocations
209 : * from a stream of bytes, and performing a separate Bernoulli trial on every
210 : * byte from the stream. We can accomplish this by calling |t.trial(S)| for
211 : * the number of bytes S, and sampling the event if that returns true.
212 : *
213 : * Of course, this style of sampling needs to be paired with analysis and
214 : * presentation that makes the "size" of the event apparent, lest trials with
215 : * large values for |n| appear to be indistinguishable from those with small
216 : * values for |n|, despite being potentially much more likely to be sampled.
217 : */
218 : bool trial(size_t aCount) {
219 : if (mSkipCount > aCount) {
220 : mSkipCount -= aCount;
221 : return false;
222 : }
223 :
224 : return chooseSkipCount();
225 : }
226 :
227 0 : void setRandomState(uint64_t aState0, uint64_t aState1) {
228 0 : mGenerator.setState(aState0, aState1);
229 0 : }
230 :
231 315 : void setProbability(double aProbability) {
232 315 : MOZ_ASSERT(0 <= aProbability && aProbability <= 1);
233 315 : mProbability = aProbability;
234 315 : if (0 < mProbability && mProbability < 1) {
235 : /*
236 : * Let's look carefully at how this calculation plays out in floating-
237 : * point arithmetic. We'll assume IEEE, but the final C++ code we arrive
238 : * at would still be fine if our numbers were mathematically perfect. So,
239 : * while we've considered IEEE's edge cases, we haven't done anything that
240 : * should be actively bad when using other representations.
241 : *
242 : * (In the below, read comparisons as exact mathematical comparisons: when
243 : * we say something "equals 1", that means it's exactly equal to 1. We
244 : * treat approximation using intervals with open boundaries: saying a
245 : * value is in (0,1) doesn't specify how close to 0 or 1 the value gets.
246 : * When we use closed boundaries like [2**-53, 1], we're careful to ensure
247 : * the boundary values are actually representable.)
248 : *
249 : * - After the comparison above, we know mProbability is in (0,1).
250 : *
251 : * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53].
252 : *
253 : * - Because the floating-point gaps near 1 are wider than those near
254 : * zero, there are many small positive doubles ε such that 1-ε rounds to
255 : * exactly 1. However, 2**-53 can be represented exactly. So
256 : * 1-mProbability is in [2**-53, 1].
257 : *
258 : * - log(1 - mProbability) is thus in (-37, 0].
259 : *
260 : * That range includes zero, but when we use mInvLogNotProbability, it
261 : * would be helpful if we could trust that it's negative. So when log(1
262 : * - mProbability) is 0, we'll just set mProbability to 0, so that
263 : * mInvLogNotProbability is not used in chooseSkipCount.
264 : *
265 : * - How much of the range of mProbability does this cause us to ignore?
266 : * The only value for which log returns 0 is exactly 1; the slope of log
267 : * at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε,
268 : * never 0. The gaps near one are larger than the gaps near zero, so if
269 : * 1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability)
270 : * isn't 0, then 1 - mProbability isn't 1, which means that mProbability
271 : * is at least 2**-53, as discussed earlier. This is a sampling
272 : * likelihood of roughly one in ten trillion, which is unlikely to be
273 : * distinguishable from zero in practice.
274 : *
275 : * So by forbidding zero, we've tightened our range to (-37, -2**-53].
276 : *
277 : * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all
278 : * falls readily within the range of an IEEE double.
279 : *
280 : * ALL THAT HAVING BEEN SAID: here are the five lines of actual code:
281 : */
282 0 : double logNotProbability = std::log(1 - mProbability);
283 0 : if (logNotProbability == 0.0)
284 0 : mProbability = 0.0;
285 : else
286 0 : mInvLogNotProbability = 1 / logNotProbability;
287 : }
288 :
289 315 : chooseSkipCount();
290 315 : }
291 :
292 : private:
293 : /* The likelihood that any given call to |trial| should return true. */
294 : double mProbability;
295 :
296 : /*
297 : * The value of 1/std::log(1 - mProbability), cached for repeated use.
298 : *
299 : * If mProbability is exactly 0 or exactly 1, we don't use this value.
300 : * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e.
301 : * definitely negative, as required by chooseSkipCount. See setProbability for
302 : * the details.
303 : */
304 : double mInvLogNotProbability;
305 :
306 : /* Our random number generator. */
307 : non_crypto::XorShift128PlusRNG mGenerator;
308 :
309 : /* The number of times |trial| should return false before next returning true. */
310 : size_t mSkipCount;
311 :
312 : /*
313 : * Choose the next skip count. This also returns the value that |trial| should
314 : * return, since we have to check for the extreme values for mProbability
315 : * anyway, and |trial| should never return true at all when mProbability is 0.
316 : */
317 315 : bool chooseSkipCount() {
318 : /*
319 : * If the probability is 1.0, every call to |trial| returns true. Make sure
320 : * mSkipCount is 0.
321 : */
322 315 : if (mProbability == 1.0) {
323 315 : mSkipCount = 0;
324 315 : return true;
325 : }
326 :
327 : /*
328 : * If the probabilility is zero, |trial| never returns true. Don't bother us
329 : * for a while.
330 : */
331 0 : if (mProbability == 0.0) {
332 0 : mSkipCount = SIZE_MAX;
333 0 : return false;
334 : }
335 :
336 : /*
337 : * What sorts of values can this call to std::floor produce?
338 : *
339 : * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log
340 : * returns a value in the range [-infinity, -2**-53], all negative. Since
341 : * mInvLogNotProbability is negative (see its comments), the product is
342 : * positive and possibly infinite. std::floor returns +infinity unchanged.
343 : * So the result will always be positive.
344 : *
345 : * Converting a double to an integer that is out of range for that integer
346 : * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure
347 : * we get an acceptable value for mSkipCount.
348 : *
349 : * The clamp is written carefully. Note that if we had said:
350 : *
351 : * if (skipCount > SIZE_MAX)
352 : * skipCount = SIZE_MAX;
353 : *
354 : * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to
355 : * double is 2^64, not 2^64-1, so this doesn't actually set skipCount to a
356 : * value that can be safely assigned to mSkipCount.
357 : *
358 : * Jakub Oleson cleverly suggested flipping the sense of the comparison: if
359 : * we require that skipCount < SIZE_MAX, then because of the gaps (2048)
360 : * between doubles at that magnitude, the highest double less than 2^64 is
361 : * 2^64 - 2048, which is fine to store in a size_t.
362 : *
363 : * (On 32-bit machines, all size_t values can be represented exactly in
364 : * double, so all is well.)
365 : */
366 0 : double skipCount = std::floor(std::log(mGenerator.nextDouble())
367 0 : * mInvLogNotProbability);
368 0 : if (skipCount < SIZE_MAX)
369 0 : mSkipCount = skipCount;
370 : else
371 0 : mSkipCount = SIZE_MAX;
372 :
373 0 : return true;
374 : }
375 : };
376 :
377 : } /* namespace mozilla */
378 :
379 : #endif /* mozilla_FastBernoulliTrial_h */
|