HomeTechniques and TipsCorrelation in @RISKCorrelation of Discrete Distributions

# 5.6. Correlation of Discrete Distributions

Applies to: @RISK, all versions

I specified a correlation of 0.5 between two RiskBinomial(1, 0.5) distributions, and the actual correlation of simulated results was much lower. I understand that simulated results will only approximately match the requested correlation, but why such a big difference?

This issue will occur to some extent for any discrete distribution. In general, the fewer the possible values of the distribution, the greater discrepancy you will see. Here is the explanation, using two RiskBinomial(1, 0.5) as illustration.

Two of these distributions correlated at 0.5 give (0,0) 37.5% of the time, (0,1) and (1,0) each 12.5% of the time, and (1,1) 37.5% of the time. That is the same as saying that the data pairs (0,0), (0,0), (0,0), (0,1), (1,0), (1,1), (1,1), (1,1) occur with equal frequency. If you put these eight data pairs into Excel, the CORREL( ) function does indeed return 0.5.

So why doesn't @RISK produce those data pairs with about those frequencies? This has to do with how @RISK generates random numbers for correlated distributions.

In order to generate correlated samples for any distributions, @RISK first generates 100 (or however many iterations are desired) pairs of random decimals between 0 and 1 that have the specified correlation coefficient. Call these numbers the U01 numbers. The U01 numbers are then plugged into the distribution's inverse cumulative distribution, which converts the U01 number into a sample over the range of the distribution. In the specific case of a RiskBinomial(1,0.5), any of the U01 numbers below 0.5 map to a 0, and any above 0.5 map to a 1. That all works as designed.

Note, however, that in mapping all those U01 numbers in that way we lose a lot of information. For example, 0.1 maps to a value of 0 the same as 0.49 does, and 0.51 and 0.99 both map to a value of 1. At this point, a lot of the correlation information is lost, because the 0.1 U01 number from the first distribution is more likely paired with a U01 number from the second distribution close to 0.1 than close to 0.49, but that information is lost, since they are both samples of 0.

Another way of looking at it is that at the end, when we have 100 samples from each distribution that are all either 0 or 1, there are many ways to assign ranks to those samples to calculate the Spearman rank-order correlation coefficient. Assigning them all the mid-rank (25.5 & 75.5) is just one way to do it. (For more on this, see How @RISK Computes Rank-Order Correlation.)

If we could assign the 0 sample that came from a lower U01 to a lower rank, and a 0 sample that came from a higher U01 to a higher rank, we'd get an observed correlation coefficient closer to what was asked for. But after the simulation, a 0 is a 0 and a 1 is a 1, and the information about where they came from is not available to @RISK's RiskCorrel( ) function or Excel's CORREL( ) function.