Skip to the content of the web site.

Project K.1: n choose k

Suppose you wanted to choose $k$ digits from $0$ to $n - 1$.

You could proceed as follows:

  1. Keep track of those $k$ entries you have found in an array.
  2. Iterating $k$ times:
    1. Choose a random entry between $0$ and $n - 1$.
    2. If it is already in our array, return to Step i;
    3. otherwise, put this new entry into our array and continue.

Here is an implementation of this algorithm; however, you may also want to try this yourself, first.

#include <cstdlib>
unsigned int *choose( unsigned int n, unsigned int k );

unsigned int *choose( unsigned int n, unsigned int k ) {
    unsigned int *array = new unsigned int[k];

    for ( unsigned int i{0}; i < k; ++i ) {
        while ( true ) {
            bool found{false};
            unsigned int m{std::rand() % n};

            for ( unsigned int j{0}; j < i; ++j ) {
                if ( array[j] == m ) {
                    found = true;
                    break;
                }
            }

            if ( !found ) {
                array[i] = m;
                break;
            }
        }
    }

    return array;
}

Now, how long will this algorithm run? One significant issue is we can never be sure that it will even finish; after all, there is a non-zero probability that any sequence of $N$ randomly chosen values between $0$ and $n - 1$ are not already in our array of found values.

Now, one way of testing this algorithm would be to try to time how long it takes to run choose( n, n ) for different values of $n$; however, without much thought; however, if $k > \frac{n}{2}$, it would be easy to find choose( n, n - k ). Thus, the program average.cpp calculates choose( n, n/2 ) for $n = 1000, 2000, 3000, \ldots, 100000$.

Now, it may happen that in calling choose( 1000, 500 ) that all $500$ selections are indeed unique, so instead of calling choose( 1000, 500 ) just once and timing it, we will instead call choose( 1000, 500 ) one hundred times, and then print the average run time. This gives the following times:

$n$Average run time of choose( n, n/2 )
(${\rm s}$)
10000.00
20000.00
30000.00
40000.00
50000.00
60000.01
70000.00
80000.00
90000.01
100000.01
110000.01
120000.01
130000.02
140000.01
150000.02
160000.02
170000.03
180000.02
190000.03
200000.03
210000.04
220000.04
230000.04
240000.05
250000.05
260000.06
270000.06
280000.06
290000.07
300000.08
310000.07
320000.09
330000.09
340000.09
350000.10
360000.11
370000.11
380000.12
390000.13
400000.13
410000.13
420000.15
430000.15
440000.16
450000.17
460000.17
470000.18
480000.19
490000.20
500000.20
510000.21
520000.23
530000.23
540000.23
550000.25
560000.26
570000.27
580000.27
590000.29
600000.29
610000.31
620000.31
630000.33
640000.33
650000.35
660000.35
670000.37
680000.38
690000.39
700000.40
710000.42
720000.42
730000.44
740000.44
750000.46
760000.48
770000.48
780000.50
790000.51
800000.53
810000.53
820000.55
830000.57
840000.57
850000.60
860000.60
870000.62
880000.63
890000.65
900000.66
910000.68
920000.69
930000.71
940000.72
950000.74
960000.76
970000.77
980000.78
990000.81
1000000.81

Looking at this data, it may be difficult to see the trend; however, an image makes the trend clear:


Figure 1. The run-time of choose( n, n/2 ) for $n = 1000, 2000, ..., 100000$.

The red line in Figure 1 is the best-fitting least-squares quadratic polynomial $p(n) = 2.8195 \times 10^{-8} n + 8.1437\times 10^{-11} n^2$.

What this trend suggests is that if you were to try to calculate choose( 1000000, 500000 ), it would take on average 81.5 seconds!

The program maximum.cpp executes choose( 1000000, 500000 ) a total of 100 times, and on average each execution is 82.07 seconds, and and the worst-case run time was 84 seconds.

A different approach

Instead, consider the following algorithm:

  1. Create an array of capacity $n$, and initialize the entries from $0$ to $n - 1$.
  2. Now, for $i = 0, \ldots, k - 1$, do the following:
    1. Select a random entry from $i$ to $n - i$ and swap that entry with the entry $i$.

The $k$ entries chosen from $0$ to $n - 1$ now occupy the entries $0$ through $k - 1$ in the given array.

Implement this new function.

Trade-offs

This new function is faster than the one that we implemented above: it always finishes executing in $k$ steps after initializing an array of capacity $n$.

Notice, however, that while this function will run much faster when finding choose( 100000, 50000 ), it may actually be slower when calculating choose( 100000, 5 ). Additionally, while in many cases it may be faster, the size of the array is $n$, while the size of the array for the function implemented above is only $k$.