rand

Your advise to change the range of the return value of rand() by using the modulus is not a good idea. You have correctly pointed out that rand() will be biased towards over representing numbers, negating the hopes for a uniform distribution. It is much better to use range compression, as in:

(long)((float)rand()*(100.0/(float)RAND_MAX)) to get a range from [0...100]

than to use rand() % 100;

If a uniform distribution is not desired then your suggestion is fine.
You're wrong. Doing range compression does not remove the bias, it simply redistributes it. And it is slower. So using modulus is superior.

EDIT:

Since you probably won't believe me, here's a simplified example to illustrate:

Assumptions:

A) rand() produces a finite number of outputs 'X'
B) we want to translate rand()'s output to a smaller subset 'Y'
C) We assume rand()'s output is even distribution... ie: all numbers [0..X) have an equal chance of appearing (even though this is implementation dependent and may not always be true). If C is not true than rand() is fundamentally biased and there is nothing you can do to remove the bias - so for purposes of this example we assume C is true.


To keep the example simple, let's do some runs with the following numbers:
X = 10
Y = 3

that is, rand() produces [0..10) and we want to generate [0..3)

Illustration:
1
2
3
4
5
// compressive approach:
a = (int)( (float)rand() * 3 / 10.0f );

// mod approach
b = rand() % 3;


Since X is small, we can simply map out how X values will translate to Y values for each approach:

1
2
3
4
5
6
7
8
9
10
11
12
rand output    |    mod result   |   compress result
====================================================
    0                   0                  0
    1                   1                  0
    2                   2                  0
    3                   0                  0
    4                   1                  1
    5                   2                  1
    6                   0                  1
    7                   1                  2
    8                   2                  2
    9                   0                  2


Notice that in each approach (even the compression method), 0 appears more frequently than 1 or 2. So both approaches are biased towards 0.



EDIT 2:

It comes down to a very simple principle. You are translating X possible outputs into Y possible outputs. So unless X is evenly divisible by Y, there is going to be bias no matter what kind of transformation you do. X % Y numbers are going to be biased.

plug in whatever numbers for X and Y that you like:

X=54
Y=10
4 results are going to have bias... each item in Y can be generated fairly 5 times, but then there are 4 additional possible outputs of X that cause the bias.


AFAIK the only way to completely avoid this bias is to choose a number 'Z' with which:
X % Z is zero
Z >= Y

Then use Z to scale down from rand, and "reroll" if the result is >= Y

So to carry with my previous example:
X = 10
Y = 3
Z = 5

1
2
3
4
5
6
7
8
9
// with bias:
int result = rand() % 3;

// without bias
int result;
do
{
  result = rand() % 5;
}while( result >= 3 );


But talk about slow.

Also... EVEN THAT may not be without bias. Since you are effectively discarding numbers in X's set, you are breaking assumption C above. (unless Y never changes. If Y and Z remain constant throughout the life of the RNG, then this would truly generate numbers without bias).


EDIT 3:

Lastly note that this bias exists of all digital rngs, since they all have a fixed number of possible outputs. Even the really complex ones. Even floating point ones. If the rng is digital, it has a finite number of possible outputs -- and translating those to a different range of finite outputs is going to have bias unless it's evenly divisible.

Mathematical fact.
Last edited on
What you can do that will remove the bias is this:
1
2
3
4
5
6
7
8
9
10
//Returns an integer in [0;max]
int odd_rand(int max){
    //const int rand_max=RAND_MAX-RAND_MAX%max;
    const int rand_max=RAND_MAX-(RAND_MAX+1)%(max+1);
    int r;
    do
        r=rand();
    while (r>rand_max);
    return r%(max+1);
}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
                                                            max=2
                                                          RAND_MAX=9
                                                          rand_max=8
rand output    |    mod result   |   compress result   |  odd_rand()
====================================================================
    0                   0                  0                  0
    1                   1                  0                  1
    2                   2                  0                  2
    3                   0                  0                  0
    4                   1                  1                  1
    5                   2                  1                  2
    6                   0                  1                  0
    7                   1                  2                  1
    8                   2                  2                  2
    9                   0                  2              undefined
Last edited on
ah, a smarter version of the "reroll" technique. Very nice.

Though I still think that since you are discarding output you are adding bias unless 'max' is constant.
Last edited on
Look at it like this: if you flip a coin twice and reflip on TT, HH HT and TH are all three equally likely outcomes. The coin doesn't know that you discarded TT.

What I'm worried about is RAND_MAX-RAND_MAX%max. I get the feeling I'm making an off-by-one error, but I'm having a hard time finding where exactly.
Look at it like this: if you flip a coin twice and reflip on TT, HH HT and TH are all three equally likely outcomes. The coin doesn't know that you discarded TT.


Right but if 'max' changes that discarding will throw off distribution.

Actually that would throw of distribution anyway. So maybe it isn't adding bias.

I don't know enough about statistics.
Well, normally you wouldn't want to change max within the same experiment.
For example,
1
2
3
4
5
int a[4]={0};
for (int b=0;b<1000;b++){
    a[odd_rand(3)]++;
    a[odd_rand(2)]++;
}
Regardless of the distribution of odd_rand(), the distribution of a will not be uniform. Obviously, there's nothing the generator can do about this.
What we can do is ensure that the possible outputs are all equiprobable within each individual call.

Also, I was making an off-by-one error:
 
const int rand_max=RAND_MAX-(RAND_MAX+1)%(max+1);
Last edited on
i prefer to use random_shuffle vs. rand() its like shuffling a deck of cards
#include <algorithm>

int main()
{
//range limit
const int range = 100;

// array to be filled
int randArray[100];

void randomNumberGenerator(int randArray);

getChar();
return 0;
}

void randomNumberGenerator(int randArray)
{
int numbers[52] = {0};

int j = 0;

//lays numbers in order
for (int i = 0; i < range; i++)
numbers[i] = i;

//shuffles your array
random_shuffle(&numbers[0], &numbers[100);

for (int i = 0; i < range; i++)
numbers[i] = randArray[i];

}
Now the next 100 cards will be unique each time and in range
Last edited on
A shuffle-and-draw has extremely undesirable properties as a general purpose PRNG:
1. There's a relatively low upper bound on the distance between two apparitions of the same number.
2. Certain substrings, such as getting the same number three times in a row, are completely impossible. If you're getting numbers in [0;99] from a uniform generator, you'd expect to get a triplicate once every (very roughly) 10000 iterations.
Last edited on
> If the rng is digital, it has a finite number of possible outputs -- and translating those to a different range of finite outputs is going to have bias unless it's evenly divisible.

Scaling down:

Given a unform random number generator that generates integers in the inclusive interval [1,97], write an algorithm to get an unbiased uniform random number in the interval [1,23]

1
2
3
4
5
6
int uniform_in_1_23( rng_uniform_in_1_97 gen )
{
     int number ; 
     do number = gen() - 1 ; while( number > 91 ) ;
     return number / 4  + 1 ;
}


Scaling up:

Given a unform random number generator that generates integers in the inclusive interval [1,5], write an algorithm to get an unbiased uniform random number in the interval [1,7]

1
2
3
4
5
6
7
8
9
1. generate a pair of random numbers, each in [1,5]
2.   if pair is one of (1,1), (1,2), (1,3) => return 1
     if pair is one of (1,4), (1,5), (2,1) => return 2
     if pair is one of (2,2), (2,3), (2,4) => return 3
     if pair is one of (2,5), (3,1), (3,2) => return 4
     if pair is one of (3,3), (3,4), (3,5) => return 5
     if pair is one of (4,1), (4,2), (4,3) => return 6
     if pair is one of (4,4), (4,5), (5,1) => return 7
     if pair is one of (5,2), (5,3), (5,4), (5,5) => go to step 1


These require an unbounded number of draws.

In practice, cleverer algorithms are used so that it can be done with a bounded number of draws, with the bias tending to zero. (There is an obvious time vs bias trade off - the more the number of draws we are willing to make, the less the bias). For an example, see the boost implementation (in the next post).

Note: Mathematically, it is possible to reduce the sampling bias to an infinitismal, even where the original RNG has a bias. For instance:
http://web.eecs.umich.edu/~qstout/pap/AnnProb84.pdf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
////////////////////////  boost /////////////////////////////////

template<class Engine, class T>
T generate_uniform_int(
    Engine& eng, T min_value, T max_value,
    boost::mpl::true_ /** is_integral<Engine::result_type> */)
{
    typedef T result_type;
    typedef typename make_unsigned<T>::type range_type;
    typedef typename Engine::result_type base_result;
    // ranges are always unsigned
    typedef typename make_unsigned<base_result>::type base_unsigned;
    const range_type range = random::detail::subtract<result_type>()(max_value, min_value);
    const base_result bmin = (eng.min)();
    const base_unsigned brange =
      random::detail::subtract<base_result>()((eng.max)(), (eng.min)());

    if(range == 0) {
      return min_value;    
    } else if(brange == range) {
      // this will probably never happen in real life
      // basically nothing to do; just take care we don't overflow / underflow
      base_unsigned v = random::detail::subtract<base_result>()(eng(), bmin);
      return random::detail::add<base_unsigned, result_type>()(v, min_value);
    } else if(brange < range) {
      // use rejection method to handle things like 0..3 --> 0..4
      for(;;) {
        // concatenate several invocations of the base RNG
        // take extra care to avoid overflows

        //  limit == floor((range+1)/(brange+1))
        //  Therefore limit*(brange+1) <= range+1
        range_type limit;
        if(range == (std::numeric_limits<range_type>::max)()) {
          limit = range/(range_type(brange)+1);
          if(range % (range_type(brange)+1) == range_type(brange))
            ++limit;
        } else {
          limit = (range+1)/(range_type(brange)+1);
        }

        // We consider "result" as expressed to base (brange+1):
        // For every power of (brange+1), we determine a random factor
        range_type result = range_type(0);
        range_type mult = range_type(1);

        // loop invariants:
        //  result < mult
        //  mult <= range
        while(mult <= limit) {
          // Postcondition: result <= range, thus no overflow
          //
          // limit*(brange+1)<=range+1                   def. of limit       (1)
          // eng()-bmin<=brange                          eng() post.         (2)
          // and mult<=limit.                            loop condition      (3)
          // Therefore mult*(eng()-bmin+1)<=range+1      by (1),(2),(3)      (4)
          // Therefore mult*(eng()-bmin)+mult<=range+1   rearranging (4)     (5)
          // result<mult                                 loop invariant      (6)
          // Therefore result+mult*(eng()-bmin)<range+1  by (5), (6)         (7)
          //
          // Postcondition: result < mult*(brange+1)
          //
          // result<mult                                 loop invariant      (1)
          // eng()-bmin<=brange                          eng() post.         (2)
          // Therefore result+mult*(eng()-bmin) <
          //           mult+mult*(eng()-bmin)            by (1)              (3)
          // Therefore result+(eng()-bmin)*mult <
          //           mult+mult*brange                  by (2), (3)         (4)
          // Therefore result+(eng()-bmin)*mult <
          //           mult*(brange+1)                   by (4)
          result += static_cast<range_type>(random::detail::subtract<base_result>()(eng(), bmin) * mult);

          // equivalent to (mult * (brange+1)) == range+1, but avoids overflow.
          if(mult * range_type(brange) == range - mult + 1) {
              // The destination range is an integer power of
              // the generator's range.
              return(result);
          }

          // Postcondition: mult <= range
          // 
          // limit*(brange+1)<=range+1                   def. of limit       (1)
          // mult<=limit                                 loop condition      (2)
          // Therefore mult*(brange+1)<=range+1          by (1), (2)         (3)
          // mult*(brange+1)!=range+1                    preceding if        (4)
          // Therefore mult*(brange+1)<range+1           by (3), (4)         (5)
          // 
          // Postcondition: result < mult
          //
          // See the second postcondition on the change to result. 
          mult *= range_type(brange)+range_type(1);
        }
        // loop postcondition: range/mult < brange+1
        //
        // mult > limit                                  loop condition      (1)
        // Suppose range/mult >= brange+1                Assumption          (2)
        // range >= mult*(brange+1)                      by (2)              (3)
        // range+1 > mult*(brange+1)                     by (3)              (4)
        // range+1 > (limit+1)*(brange+1)                by (1), (4)         (5)
        // (range+1)/(brange+1) > limit+1                by (5)              (6)
        // limit < floor((range+1)/(brange+1))           by (6)              (7)
        // limit==floor((range+1)/(brange+1))            def. of limit       (8)
        // not (2)                                       reductio            (9)
        //
        // loop postcondition: (range/mult)*mult+(mult-1) >= range
        //
        // (range/mult)*mult + range%mult == range       identity            (1)
        // range%mult < mult                             def. of %           (2)
        // (range/mult)*mult+mult > range                by (1), (2)         (3)
        // (range/mult)*mult+(mult-1) >= range           by (3)              (4)
        //
        // Note that the maximum value of result at this point is (mult-1),
        // so after this final step, we generate numbers that can be
        // at least as large as range.  We have to really careful to avoid
        // overflow in this final addition and in the rejection.  Anything
        // that overflows is larger than range and can thus be rejected.

        // range/mult < brange+1  -> no endless loop
        range_type result_increment =
            generate_uniform_int(
                eng,
                static_cast<range_type>(0),
                static_cast<range_type>(range/mult),
                boost::mpl::true_());
        if((std::numeric_limits<range_type>::max)() / mult < result_increment) {
          // The multiplcation would overflow.  Reject immediately.
          continue;
        }
        result_increment *= mult;
        // unsigned integers are guaranteed to wrap on overflow.
        result += result_increment;
        if(result < result_increment) {
          // The addition overflowed.  Reject.
          continue;
        }
        if(result > range) {
          // Too big.  Reject.
          continue;
        }
        return random::detail::add<range_type, result_type>()(result, min_value);
      }
    } else {                   // brange > range
      base_unsigned bucket_size;
      // it's safe to add 1 to range, as long as we cast it first,
      // because we know that it is less than brange.  However,
      // we do need to be careful not to cause overflow by adding 1
      // to brange.
      if(brange == (std::numeric_limits<base_unsigned>::max)()) {
        bucket_size = brange / (static_cast<base_unsigned>(range)+1);
        if(brange % (static_cast<base_unsigned>(range)+1) == static_cast<base_unsigned>(range)) {
          ++bucket_size;
        }
      } else {
        bucket_size = (brange+1) / (static_cast<base_unsigned>(range)+1);
      }
      for(;;) {
        base_unsigned result =
          random::detail::subtract<base_result>()(eng(), bmin);
        result /= bucket_size;
        // result and range are non-negative, and result is possibly larger
        // than range, so the cast is safe
        if(result <= static_cast<base_unsigned>(range))
          return random::detail::add<base_unsigned, result_type>()(result, min_value);
      }
    }
}
Topic archived. No new replies allowed.