LCOV - output.info - mfbt/FastBernoulliTrial.h

LCOV - code coverage report

Current view:	top level - mfbt - FastBernoulliTrial.h (source / functions)		Hit	Total	Coverage
Test:	output.info	Lines:	15	36	41.7 %
Date:	2017-07-14 16:53:18	Functions:	3	5	60.0 %
Legend:	Lines: hit not hit

          Line data    Source code

       1             : /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
       2             : /* vim: set ts=8 sts=2 et sw=2 tw=80: */
       3             : /* This Source Code Form is subject to the terms of the Mozilla Public
       4             :  * License, v. 2.0. If a copy of the MPL was not distributed with this
       5             :  * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
       6             : 
       7             : #ifndef mozilla_FastBernoulliTrial_h
       8             : #define mozilla_FastBernoulliTrial_h
       9             : 
      10             : #include "mozilla/Assertions.h"
      11             : #include "mozilla/XorShift128PlusRNG.h"
      12             : 
      13             : #include <cmath>
      14             : #include <stdint.h>
      15             : 
      16             : namespace mozilla {
      17             : 
      18             : /**
      19             :  * class FastBernoulliTrial: Efficient sampling with uniform probability
      20             :  *
      21             :  * When gathering statistics about a program's behavior, we may be observing
      22             :  * events that occur very frequently (e.g., function calls or memory
      23             :  * allocations) and we may be gathering information that is somewhat expensive
      24             :  * to produce (e.g., call stacks). Sampling all the events could have a
      25             :  * significant impact on the program's performance.
      26             :  *
      27             :  * Why not just sample every N'th event? This technique is called "systematic
      28             :  * sampling"; it's simple and efficient, and it's fine if we imagine a
      29             :  * patternless stream of events. But what if we're sampling allocations, and the
      30             :  * program happens to have a loop where each iteration does exactly N
      31             :  * allocations? You would end up sampling the same allocation every time through
      32             :  * the loop; the entire rest of the loop becomes invisible to your measurements!
      33             :  * More generally, if each iteration does M allocations, and M and N have any
      34             :  * common divisor at all, most allocation sites will never be sampled. If
      35             :  * they're both even, say, the odd-numbered allocations disappear from your
      36             :  * results.
      37             :  *
      38             :  * Ideally, we'd like each event to have some probability P of being sampled,
      39             :  * independent of its neighbors and of its position in the sequence. This is
      40             :  * called "Bernoulli sampling", and it doesn't suffer from any of the problems
      41             :  * mentioned above.
      42             :  *
      43             :  * One disadvantage of Bernoulli sampling is that you can't be sure exactly how
      44             :  * many samples you'll get: technically, it's possible that you might sample
      45             :  * none of them, or all of them. But if the number of events N is large, these
      46             :  * aren't likely outcomes; you can generally expect somewhere around P * N
      47             :  * events to be sampled.
      48             :  *
      49             :  * The other disadvantage of Bernoulli sampling is that you have to generate a
      50             :  * random number for every event, which can be slow.
      51             :  *
      52             :  * [significant pause]
      53             :  *
      54             :  * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli
      55             :  * sampling, while generating a fresh random number only when we do decide to
      56             :  * sample an event, not on every trial. When it decides not to sample, a call to
      57             :  * |FastBernoulliTrial::trial| is nothing but decrementing a counter and
      58             :  * comparing it to zero. So the lower your sampling probability is, the less
      59             :  * overhead FastBernoulliTrial imposes.
      60             :  *
      61             :  * Probabilities of 0 and 1 are handled efficiently. (In neither case need we
      62             :  * ever generate a random number at all.)
      63             :  *
      64             :  * The essential API:
      65             :  *
      66             :  * - FastBernoulliTrial(double P)
      67             :  *   Construct an instance that selects events with probability P.
      68             :  *
      69             :  * - FastBernoulliTrial::trial()
      70             :  *   Return true with probability P. Call this each time an event occurs, to
      71             :  *   decide whether to sample it or not.
      72             :  *
      73             :  * - FastBernoulliTrial::trial(size_t n)
      74             :  *   Equivalent to calling trial() |n| times, and returning true if any of those
      75             :  *   calls do. However, like trial, this runs in fast constant time.
      76             :  *
      77             :  *   What is this good for? In some applications, some events are "bigger" than
      78             :  *   others. For example, large allocations are more significant than small
      79             :  *   allocations. Perhaps we'd like to imagine that we're drawing allocations
      80             :  *   from a stream of bytes, and performing a separate Bernoulli trial on every
      81             :  *   byte from the stream. We can accomplish this by calling |t.trial(S)| for
      82             :  *   the number of bytes S, and sampling the event if that returns true.
      83             :  *
      84             :  *   Of course, this style of sampling needs to be paired with analysis and
      85             :  *   presentation that makes the size of the event apparent, lest trials with
      86             :  *   large values for |n| appear to be indistinguishable from those with small
      87             :  *   values for |n|.
      88             :  */
      89             : class FastBernoulliTrial {
      90             :   /*
      91             :    * This comment should just read, "Generate skip counts with a geometric
      92             :    * distribution", and leave everyone to go look that up and see why it's the
      93             :    * right thing to do, if they don't know already.
      94             :    *
      95             :    * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE...
      96             :    *
      97             :    * Instead of generating a fresh random number for every trial, we can
      98             :    * randomly generate a count of how many times we should return false before
      99             :    * the next time we return true. We call this a "skip count". Once we've
     100             :    * returned true, we generate a fresh skip count, and begin counting down
     101             :    * again.
     102             :    *
     103             :    * Here's an awesome fact: by exercising a little care in the way we generate
     104             :    * skip counts, we can produce results indistinguishable from those we would
     105             :    * get "rolling the dice" afresh for every trial.
     106             :    *
     107             :    * In short, skip counts in Bernoulli trials of probability P obey a geometric
     108             :    * distribution. If a random variable X is uniformly distributed from [0..1),
     109             :    * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric
     110             :    * distribution for the skip counts.
     111             :    *
     112             :    * Why that formula?
     113             :    *
     114             :    * Suppose we're to return |true| with some probability P, say, 0.3. Spread
     115             :    * all possible futures along a line segment of length 1. In portion P of
     116             :    * those cases, we'll return true on the next call to |trial|; the skip count
     117             :    * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more.
     118             :    *
     119             :    * skip:                0                         1 or more
     120             :    *             |------------------^-----------------------------------------|
     121             :    * portion:            0.3                            0.7
     122             :    *                      P                             1-P
     123             :    *
     124             :    * But the "1 or more" section of the line is subdivided the same way: *within
     125             :    * that section*, in portion P the second call to |trial()| returns true, and in
     126             :    * portion 1-P it returns false a second time; the skip count is two or more.
     127             :    * So we return true on the second call in proportion 0.7 * 0.3, and skip at
     128             :    * least the first two in proportion 0.7 * 0.7.
     129             :    *
     130             :    * skip:                0                1              2 or more
     131             :    *             |------------------^------------^----------------------------|
     132             :    * portion:            0.3           0.7 * 0.3          0.7 * 0.7
     133             :    *                      P             (1-P)*P            (1-P)^2
     134             :    *
     135             :    * We can continue to subdivide:
     136             :    *
     137             :    * skip >= 0:  |------------------------------------------------- (1-P)^0 --|
     138             :    * skip >= 1:  |                  ------------------------------- (1-P)^1 --|
     139             :    * skip >= 2:  |                               ------------------ (1-P)^2 --|
     140             :    * skip >= 3:  |                                 ^     ---------- (1-P)^3 --|
     141             :    * skip >= 4:  |                                 .            --- (1-P)^4 --|
     142             :    *                                               .
     143             :    *                                               ^X, see below
     144             :    *
     145             :    * In other words, the likelihood of the next n calls to |trial| returning
     146             :    * false is (1-P)^n. The longer a run we require, the more the likelihood
     147             :    * drops. Further calls may return false too, but this is the probability
     148             :    * we'll skip at least n.
     149             :    *
     150             :    * This is interesting, because we can pick a point along this line segment
     151             :    * and see which skip count's range it falls within; the point X above, for
     152             :    * example, is within the ">= 2" range, but not within the ">= 3" range, so it
     153             :    * designates a skip count of 2. So if we pick points on the line at random
     154             :    * and use the skip counts they fall under, that will be indistinguishable
     155             :    * from generating a fresh random number between 0 and 1 for each trial and
     156             :    * comparing it to P.
     157             :    *
     158             :    * So to find the skip count for a point X, we must ask: To what whole power
     159             :    * must we raise 1-P such that we include X, but the next power would exclude
     160             :    * it? This is exactly std::floor(std::log(X) / std::log(1-P)).
     161             :    *
     162             :    * Our algorithm is then, simply: When constructed, compute an initial skip
     163             :    * count. Return false from |trial| that many times, and then compute a new skip
     164             :    * count.
     165             :    *
     166             :    * For a call to |trial(n)|, if the skip count is greater than n, return false
     167             :    * and subtract n from the skip count. If the skip count is less than n,
     168             :    * return true and compute a new skip count. Since each trial is independent,
     169             :    * it doesn't matter by how much n overshoots the skip count; we can actually
     170             :    * compute a new skip count at *any* time without affecting the distribution.
     171             :    * This is really beautiful.
     172             :    */
     173             :  public:
     174             :   /**
     175             :    * Construct a fast Bernoulli trial generator. Calls to |trial()| return true
     176             :    * with probability |aProbability|. Use |aState0| and |aState1| to seed the
     177             :    * random number generator; both may not be zero.
     178             :    */
     179         315 :   FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1)
     180         315 :    : mProbability(0)
     181             :    , mInvLogNotProbability(0)
     182             :    , mGenerator(aState0, aState1)
     183         315 :    , mSkipCount(0)
     184             :   {
     185         315 :     setProbability(aProbability);
     186         315 :   }
     187             : 
     188             :   /**
     189             :    * Return true with probability |mProbability|. Call this each time an event
     190             :    * occurs, to decide whether to sample it or not. The lower |mProbability| is,
     191             :    * the faster this function runs.
     192             :    */
     193           0 :   bool trial() {
     194           0 :     if (mSkipCount) {
     195           0 :       mSkipCount--;
     196           0 :       return false;
     197             :     }
     198             : 
     199           0 :     return chooseSkipCount();
     200             :   }
     201             : 
     202             :   /**
     203             :    * Equivalent to calling trial() |n| times, and returning true if any of those
     204             :    * calls do. However, like trial, this runs in fast constant time.
     205             :    *
     206             :    * What is this good for? In some applications, some events are "bigger" than
     207             :    * others. For example, large allocations are more significant than small
     208             :    * allocations. Perhaps we'd like to imagine that we're drawing allocations
     209             :    * from a stream of bytes, and performing a separate Bernoulli trial on every
     210             :    * byte from the stream. We can accomplish this by calling |t.trial(S)| for
     211             :    * the number of bytes S, and sampling the event if that returns true.
     212             :    *
     213             :    * Of course, this style of sampling needs to be paired with analysis and
     214             :    * presentation that makes the "size" of the event apparent, lest trials with
     215             :    * large values for |n| appear to be indistinguishable from those with small
     216             :    * values for |n|, despite being potentially much more likely to be sampled.
     217             :    */
     218             :   bool trial(size_t aCount) {
     219             :     if (mSkipCount > aCount) {
     220             :       mSkipCount -= aCount;
     221             :       return false;
     222             :     }
     223             : 
     224             :     return chooseSkipCount();
     225             :   }
     226             : 
     227           0 :   void setRandomState(uint64_t aState0, uint64_t aState1) {
     228           0 :     mGenerator.setState(aState0, aState1);
     229           0 :   }
     230             : 
     231         315 :   void setProbability(double aProbability) {
     232         315 :     MOZ_ASSERT(0 <= aProbability && aProbability <= 1);
     233         315 :     mProbability = aProbability;
     234         315 :     if (0 < mProbability && mProbability < 1) {
     235             :       /*
     236             :        * Let's look carefully at how this calculation plays out in floating-
     237             :        * point arithmetic. We'll assume IEEE, but the final C++ code we arrive
     238             :        * at would still be fine if our numbers were mathematically perfect. So,
     239             :        * while we've considered IEEE's edge cases, we haven't done anything that
     240             :        * should be actively bad when using other representations.
     241             :        *
     242             :        * (In the below, read comparisons as exact mathematical comparisons: when
     243             :        * we say something "equals 1", that means it's exactly equal to 1. We
     244             :        * treat approximation using intervals with open boundaries: saying a
     245             :        * value is in (0,1) doesn't specify how close to 0 or 1 the value gets.
     246             :        * When we use closed boundaries like [2**-53, 1], we're careful to ensure
     247             :        * the boundary values are actually representable.)
     248             :        *
     249             :        * - After the comparison above, we know mProbability is in (0,1).
     250             :        *
     251             :        * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53].
     252             :        *
     253             :        * - Because the floating-point gaps near 1 are wider than those near
     254             :        *   zero, there are many small positive doubles ε such that 1-ε rounds to
     255             :        *   exactly 1. However, 2**-53 can be represented exactly. So
     256             :        *   1-mProbability is in [2**-53, 1].
     257             :        *
     258             :        * - log(1 - mProbability) is thus in (-37, 0].
     259             :        *
     260             :        *   That range includes zero, but when we use mInvLogNotProbability, it
     261             :        *   would be helpful if we could trust that it's negative. So when log(1
     262             :        *   - mProbability) is 0, we'll just set mProbability to 0, so that
     263             :        *   mInvLogNotProbability is not used in chooseSkipCount.
     264             :        *
     265             :        * - How much of the range of mProbability does this cause us to ignore?
     266             :        *   The only value for which log returns 0 is exactly 1; the slope of log
     267             :        *   at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε,
     268             :        *   never 0. The gaps near one are larger than the gaps near zero, so if
     269             :        *   1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability)
     270             :        *   isn't 0, then 1 - mProbability isn't 1, which means that mProbability
     271             :        *   is at least 2**-53, as discussed earlier. This is a sampling
     272             :        *   likelihood of roughly one in ten trillion, which is unlikely to be
     273             :        *   distinguishable from zero in practice.
     274             :        *
     275             :        *   So by forbidding zero, we've tightened our range to (-37, -2**-53].
     276             :        *
     277             :        * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all
     278             :        *   falls readily within the range of an IEEE double.
     279             :        *
     280             :        * ALL THAT HAVING BEEN SAID: here are the five lines of actual code:
     281             :        */
     282           0 :       double logNotProbability = std::log(1 - mProbability);
     283           0 :       if (logNotProbability == 0.0)
     284           0 :         mProbability = 0.0;
     285             :       else
     286           0 :         mInvLogNotProbability = 1 / logNotProbability;
     287             :     }
     288             : 
     289         315 :     chooseSkipCount();
     290         315 :   }
     291             : 
     292             :  private:
     293             :   /* The likelihood that any given call to |trial| should return true. */
     294             :   double mProbability;
     295             : 
     296             :   /*
     297             :    * The value of 1/std::log(1 - mProbability), cached for repeated use.
     298             :    *
     299             :    * If mProbability is exactly 0 or exactly 1, we don't use this value.
     300             :    * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e.
     301             :    * definitely negative, as required by chooseSkipCount. See setProbability for
     302             :    * the details.
     303             :    */
     304             :   double mInvLogNotProbability;
     305             : 
     306             :   /* Our random number generator. */
     307             :   non_crypto::XorShift128PlusRNG mGenerator;
     308             : 
     309             :   /* The number of times |trial| should return false before next returning true. */
     310             :   size_t mSkipCount;
     311             : 
     312             :   /*
     313             :    * Choose the next skip count. This also returns the value that |trial| should
     314             :    * return, since we have to check for the extreme values for mProbability
     315             :    * anyway, and |trial| should never return true at all when mProbability is 0.
     316             :    */
     317         315 :   bool chooseSkipCount() {
     318             :     /*
     319             :      * If the probability is 1.0, every call to |trial| returns true. Make sure
     320             :      * mSkipCount is 0.
     321             :      */
     322         315 :     if (mProbability == 1.0) {
     323         315 :       mSkipCount = 0;
     324         315 :       return true;
     325             :     }
     326             : 
     327             :     /*
     328             :      * If the probabilility is zero, |trial| never returns true. Don't bother us
     329             :      * for a while.
     330             :      */
     331           0 :     if (mProbability == 0.0) {
     332           0 :       mSkipCount = SIZE_MAX;
     333           0 :       return false;
     334             :     }
     335             : 
     336             :     /*
     337             :      * What sorts of values can this call to std::floor produce?
     338             :      *
     339             :      * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log
     340             :      * returns a value in the range [-infinity, -2**-53], all negative. Since
     341             :      * mInvLogNotProbability is negative (see its comments), the product is
     342             :      * positive and possibly infinite. std::floor returns +infinity unchanged.
     343             :      * So the result will always be positive.
     344             :      *
     345             :      * Converting a double to an integer that is out of range for that integer
     346             :      * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure
     347             :      * we get an acceptable value for mSkipCount.
     348             :      *
     349             :      * The clamp is written carefully. Note that if we had said:
     350             :      *
     351             :      *    if (skipCount > SIZE_MAX)
     352             :      *       skipCount = SIZE_MAX;
     353             :      *
     354             :      * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to
     355             :      * double is 2^64, not 2^64-1, so this doesn't actually set skipCount to a
     356             :      * value that can be safely assigned to mSkipCount.
     357             :      *
     358             :      * Jakub Oleson cleverly suggested flipping the sense of the comparison: if
     359             :      * we require that skipCount < SIZE_MAX, then because of the gaps (2048)
     360             :      * between doubles at that magnitude, the highest double less than 2^64 is
     361             :      * 2^64 - 2048, which is fine to store in a size_t.
     362             :      *
     363             :      * (On 32-bit machines, all size_t values can be represented exactly in
     364             :      * double, so all is well.)
     365             :      */
     366           0 :     double skipCount = std::floor(std::log(mGenerator.nextDouble())
     367           0 :                                   * mInvLogNotProbability);
     368           0 :     if (skipCount < SIZE_MAX)
     369           0 :       mSkipCount = skipCount;
     370             :     else
     371           0 :       mSkipCount = SIZE_MAX;
     372             : 
     373           0 :     return true;
     374             :   }
     375             : };
     376             : 
     377             : }  /* namespace mozilla */
     378             : 
     379             : #endif /* mozilla_FastBernoulliTrial_h */

Generated by: LCOV version 1.13