Probability distributions occur in a variety of different elements in GoldSim. And, distributions are used to provide model inputs as well as to represent simulation results. Generally, theoretical, or analytic, distributions are employed to provide model inputs because these distributions can be defined using two or three parameters. When the required parameters are provided, the theoretical equation completely describes the distribution. For results, empirical distributions are typically calculated from Monte Carlo simulation results. In this case, the results for the realizations in the simulation provide a sample from the possible spectrum of results. When GoldSim calculates a distribution for simulation results, it employs an algorithm to determine if the spread of these results is more amenable to representation with a continuous distribution or with a discrete distribution.
Example Calculation of a Distribution from a Random Sample
In order to describe the different distribution categories and to understand the differences among the categories, it is helpful to work through a simple example. In this example, a distribution is created for the values in a data set. A data set is a finite collection of related values. The values making up the set are individually distinct or discrete values. Figure 1 presents a set of 30 random integers between 0 and 10 or (0, 10).
Figure 1: Random list or collection of integers between 0 and 10
The distribution of values in a data set is a description of how often each value occurs in the set. The most common representation of a distribution of values in a data set is probably a histogram, which is a graph showing the frequency with which particular values (or range of values) occur in the data set. When a histogram represents bins, or ranges of values like all values >= 1.0 and <2.0 (i.e. GE_1 in Figure 2 labels), the frequency of occurrence depicts the frequency of occurrence of values within the bin range. The histogram in Figure 2 displays a binned histogram of the set of random integers presented in Figure 1 along with the counts or the number of times that each value occurs in the set.
Figure 2: Histogram of set shown in Figure 1
Frequency means the number of times that a value or a bin of values occurs in the collection of values. Frequency is often converted to an estimate of probability by dividing it by the total number of values or observations in the data set (or perhaps the “sample”). If the frequency of a particular value in the set is n, then the probability of this value occurring is n/N where N is the total number of observations in the data set. The calculation of the probability by dividing by N is also called normalization or could be termed the weight for the value or bin of values. A normalized histogram of a collection of values is a Probability Mass Function (PMF). A PMF is then a histogram which plots probability instead of frequency; the PMF created from the histogram in Figure 2 is shown in Figure 3.
Figure 3: PMF for the histogram shown in Figure 2
A PMF represents the probabilities for a discrete set of values. The discrete set of values is simply the data set, and it is discrete because it contains a finite list of values. If the probabilities in the PMF are represented as a cumulative sum, then the PMF is converted to a discrete cumulative distribution function (CDF). Because a CDF represents cumulative probability, the cumulative probability values in the CDF will range from 0.0 at the low end to 1.0 at the high end. A CDF provides a probability distribution which is a mathematical representation of the relative likelihood of obtaining certain specific values from a range of possible values.
The CDF shown in Figure 4 is a discrete CDF because cumulative probability is incremented in terms of each bin. In other words, the cumulative probability for the second bin ( 2 <= x < 3 ) is 0.1333 because there are two, 2-values in our set of 30 numbers corresponding to a bin probability of ~ 6.7%, and there are two, 1 values in our set of 30 numbers corresponding to a bin probability of ~ 6.7%, and the cumulative probability for the second bin is then 13.3% (6.6666667% + 6.6666667% ~= 13.3%). In the discrete CDF representation shown below the entire second bin has a cumulative probability of 13.3% and the bins are shown as “steps”. A continuous CDF would provide cumulative probability values that continue to increase as the values increase continuously so that there would be a 1.5 value with a cumulative probability that would be less than or equal to the cumulative probability for a value of 1.8.
Figure 4: CDF obtained form the PMF in Figure 3
The CDF shown in Figure 4 is an empirical probability distribution. It is empirical because it is based on the finite sample of 30 integers which represent empirical observations. In this case, the CDF is calculated directly from frequency of occurence of each value in the sample. The CDF (Figure 4) and PMF (Figure 3) shown above are also discrete because they are based on a finite set of integer values.
The alternative to a discrete distribution is a continuous distribution which is characterized with CDF which is a continuous function rather than a step function (Downey, 2011). To get from a discrete distribution to a continuous distribution, smoothing is typically performed to transform the step-type representation to a continuous function. Smoothing could be as simple as assuming linear variation, or increase, in cumulative probability between empirical or discrete values.
Another form of smoothing is to assume that the values in the data set come from an analytic continuous distribution, also called a theoretical continuous distribution. For an analytic distribution, there is an equation to calculate the cumulative probability for each value on a continuous basis given certain parameters, which are assumed to be fixed, like the mean and variance. The values in the sample or empirical data set could then used to estimate the parameters of the theoretical distribution, and the equation for the theoretical distribution provides a smoothed, continuous representation or estimate of the CDF observed in the data set.
Theoretical or analytic distributions have an equation which describes the cumulative probability of occurrence of continuous values. There are many types of probability distributions. Common analytic distributions include the normal, uniform, triangular, exponential, gamma, and Erlang distributions. The CDF of the normal distribution which was used to generate the 30 random values presented at the start of this article is shown in Figure 5.
Figure 5: Continuous theoretical distribution used to generate the sample in Figure 1
Probability Distributions in GoldSim
Probability distributions occur in a number of places in GoldSim. Probability distributions related to the three element types, listed below and see Figure 6, are discussed to provide examples of the implementation of probability distributions in GoldSim.
- Stochastic element inputs;
- Distribution Result elements; and
- Time History Result elements which display probabilities for Monte Carlo simulations.
The probability distributions available in the Stochastic Element are mainly theoretical continuous distributions which are defined by entering or providing the two or three parameters necessary to completely describe the distribution. A discrete distribution can also be defined using the Stochastic element by entering probability and value pairs. Complete description of the theoretical distributions available in GoldSim is available in Appendix B of the GoldSim User’s Guide. Another good resource on the probability distributions in GoldSim is the Probability Distributions article in the Knowledge Base. The theoretical distributions are calculated based on the equations for these distributions. Using these equations, the cumulative probability can be calculated for any value, and the corresponding value can be determined for any cumulative probability.
Figure 6: Primary GoldSim element which use probability distributions
The result elements, Distribution Results and Time History Results, generally function to create continuous empirical distributions from simulated values. The Distribution Result element will however also display theoretical distributions defined via a Stochastic element. Internally, GoldSim uses a results array to generate the empirical distributions for simulation results. For a Time History Result element, a distribution is calculated for each time interval which provides the ability to generate a 30th percentile value for day 1 and a 30th percentile value for day 300 of the simulation even though these percentile values likely correspond to different realizations. An example, Monte Carlo simulation time history result is provided in Figure 7.
Figure 7: Example of Time History results for a Monte Carlo simulation where each time interval is represented with a CDF
Distribution Calculation for Simulation Results
GoldSim determines distributions for simulation results using internal result arrays. A result array is just the collection of simulated values. It is built during a Monte Carlo simulation by adding to the result array a pair of values for each realization; the value realized and the weight given by GoldSim to the value. The array is filled "on the fly", during the simulation as each new realization is generated. Theoretically, each separate realization would represent a separate entry in the results array (consisting of a value and a weight). If unbiased sampling were carried out each separate entry would have equal weight.
As implemented in GoldSim, however, the number of data pairs in the results array may be less than the number of realizations: There are two reasons why this may be the case: If multiple results have identical values, there is no need to have identical data pairs in the results array: the weight associated with the particular value is simply adjusted (e.g., if the value occurred twice, its weight would be doubled).
For computational reasons, the results array has a maximum number of unique results which it can store. The maximum number for post-processing GoldSim simulation results is 25,000. If the number of realizations exceeds these limits, results are "merged" in a self-consistent manner. The process of merging results when the number of realizations exceeds 25,000 is as follows.
To merge a new result with the existing results (in cases where the number of realizations exceeds one of the maxima specified above), GoldSim carries out the following operations: GoldSim finds the surrounding pair of existing results, and selects one of them to merge with. GoldSim selects this result based on the ratio of the distance to the result to the weight of the result (i.e., the program preferentially merges with closer, lower weight results). After selecting the result to merge with, GoldSim replaces its value with the weighted average of its existing value and the new value; it then replaces its weight with the sum of the existing and new weights.
There is one important exception to the merging algorithm discussed above: If the new result will be an extremum (i.e., a highest or a lowest), GoldSim replaces the existing extremum with the new one, and then merges the existing result instead. This means that GoldSim never merges data with an extremum.
Continuous and Discrete CDF Creation from a Result Array
Calculating and plotting the CDF from a results array is straightforward. A discrete CDF is already calculated by building the results array using sorted values and weights and then only smoothing needs to be implemented to create the continuous CDF representation. The basic algorithm assumes that the probability distribution between each adjacent pair of result values is uniform, with a total probability equal to half the sum of the weights of the values. One implication of this assumption is that for a continuous distribution the probability of being less than the smallest value is simply equal to half the weight of the lowest value and the probability of being greater than the highest value is half the weight of the highest value.
For example, if we have ten equally weighted results in a continuous distribution, there is a uniform probability, equal to 0.1, of being between any two values. The probability of being below the lowest value or above the highest value would be 0.05. GoldSim extrapolates the value at a probability level of 0 using the slope between the first two observations. Similarly the slope between the last two observations is used to estimate the value at a probability level of 1.
In certain circumstances there are several minor variations to the basic algorithm discussed above. These variations involve representing the CDF on a discrete basis rather than a continuous basis.
- If the number of distinct results is much smaller than the number of realizations, GoldSim assumes the distribution is discrete (rather than continuous), and lumps the probabilities at the actual values sampled. In particular, if the total number of unique results is <= 20, and more than 50% of the realization results were identical to an existing result, GoldSim presumes the distribution is discrete and plots it accordingly. The user can observe this by sampling from a binomial distribution. An example is shown in Figure 8 (discrete CDF provides a step function).
- GoldSim uses a heuristic algorithm to decide if each specific result represents a discrete value: if the number of exact repetitions of a particular result exceeds 2(1 + average number of repetitions), GoldSim treats the result as a discrete value and does not smooth it. For example, suppose the result is 0.0 50% of the time, and normal (mean=10, s.d.=2) the rest of the time. The first result value would be 0.0, with a weight of about 0.5. The second value would be close to 8, with a weight of 1/# realizations. We would not want to smear half of the 0 result over the range from 0 to 8!
Figure 8: Discrete and continuous CDF results from GoldSim