Statistical Distributions in C++

Statistical Distributions in C++

One of the most common concepts in quantitative finance is that of a statistical distribution. Random variables play a huge part in quantitative financial modelling. Derivatives pricing, cash-flow forecasting and quantitative trading all make use of statistical methods in some fashion. Hence, modelling statistical distributions is extremely important in C++.

Many of the articles written on this site have made use of random number generators in order to carry out pricing tasks. So far I have been doing this in a procedural manner. Functions have been called to provide random numbers without any data encapsulation of those random number generators. The goal of this article is to show you that it is beneficial to create a class hierarchy both for statistical distributions and random number generators, separating them out in order to gain the most leverage from code reuse.

In a nutshell, we are splitting the generation of (uniform integer) random numbers from draws of specific statistical distributions, such that we can use the statistics classes elsewhere without bringing along the "heavy" random number generation functions. Equally useful is the fact that we will be able "swap out" different random number generators for our statistics classes for reasons of reliability, extensibility and efficiency.

Design of Statistical Distribution Inheritance Hierarchy

The inheritance hierarchy for modelling of statistical distributions is relatively simple. Each distribution of interest will share the same interface, so we will create an abstract base class. We are primarily interested in modelling continuous probability distributions for the time being, although in later articles we will extend the class hierarchy to handle discrete distributions.

Each (continuous) statistical distribution contains the following properties to be modelled:

• Domain Interval - The interval subset of $\mathbb{R}$ with which the distribution is defined for
• Probability Density Function - Describes the frequency for any particular value in our domain
• Cumulative Density Function - The function describing the probability that a value is less than or equal to a particular value
• Expectation - The expected value (the mean) of the distribution
• Variance - Characterisation of the spread of values around the expected value
• Standard Deviation - The square root of the variance, used because it possesses the same units as the expected value, unlike the variance

We also wish to produce a sequence of random draws from this distribution, assuming a sequence of random numbers is available to provide the "randomness". We can achieve this in two ways. We can either use the inverse cumulative distribution function (also known as the quantile function), which is a property of the distribution itself, or we can use a custom method (such as Box-Muller). Some of the distributions do not possess an analytical inverse to the CDF and hence they will need to be approximated numerically, via an appropriate algorithm. This calculation will be encapsulated into the class of the relevant inherited distribution. In later articles we will "separate out" such algorithms, allowing for re-use in other code.

Here is the partial header file for the StatisticalDistribution abstract base class (we will add extra distributions later):

#ifndef __STATISTICS_H
#define __STATISTICS_H

#include <cmath>
#include <vector>

class StatisticalDistribution {
public:
StatisticalDistribution();
virtual ~StatisticalDistribution();

// Distribution functions
virtual double pdf(const double& x) const = 0;
virtual double cdf(const double& x) const = 0;

// Inverse cumulative distribution functions (aka the quantile function)
virtual double inv_cdf(const double& quantile) const = 0;

// Descriptive stats
virtual double mean() const = 0;
virtual double var() const = 0;
virtual double stdev() const = 0;

// Obtain a sequence of random draws from this distribution
virtual void random_draws(const std::vector<double>& uniform_draws,
std::vector<double>& dist_draws) = 0;
};

#endif

We've specified pure virtual methods for the probability density function (pdf), cumulative density function (cdf), inverse cdf (inv_cdf), as well as descriptive statistics functions such as mean, var (variance) and stdev (standard deviation). Finally we have a method that takes in a vector of uniform random variables on the open interval $(0,1)$, then fills a vector of identical length with draws from the distribution.

Since all of the methods are pure virtual, we only need a very simple implementation of a source file for this class, since we are simply specifying an interface. However, we would like to see a concrete implementation of a particular class. We will consider, arguably, the most important distribution in quantitative finance, namely the standard normal distribution.

Standard Normal Distribution Implementation

Firstly we'll briefly review the formulae for the various methods we need to implement for the standard normal distribution. The probability density function of the standard normal distribution is given by:

\begin{eqnarray*} \phi(x) = \frac{1}{\sqrt{2 \pi}}e^{-\frac{1}{2}x^2} \end{eqnarray*}

The cumulative density function is given by:

\begin{eqnarray*} \Phi(x) = \frac{1}{\sqrt{2 \pi}} \int^{x}_{-\infty} e^{-\frac{t^2}{2}} dt \end{eqnarray*}

The inverse cumulative density function of the standard normal distribution (also known as the probit function) is somewhat more involved. No analytical formula exists for this particular function and so it must be approximated by numerical methods. We will utilise the Beasley-Springer-Moro algorithm, found in Korn. I won't describe it in detail here, for details refer to Korn.

Given that we are dealing with the standard normal distribution, the mean is simply $\mu = 0$, variance $\sigma^2 = 1$ and standard deviation, $\sigma = 1$. The implementation for the header file (which is a continuation of statistics.h above) is as follows:

class StandardNormalDistribution : public StatisticalDistribution {
public:
StandardNormalDistribution();
virtual ~StandardNormalDistribution();

// Distribution functions
virtual double pdf(const double& x) const;
virtual double cdf(const double& x) const;

// Inverse cumulative distribution function (aka the probit function)
virtual double inv_cdf(const double& quantile) const;

// Descriptive stats
virtual double mean() const;   // equal to 0
virtual double var() const;    // equal to 1
virtual double stdev() const;  // equal to 1

// Obtain a sequence of random draws from the standard normal distribution
virtual void random_draws(const std::vector<double>& uniform_draws,
std::vector<double>& dist_draws);
};

The source file is given below:

#ifndef __STATISTICS_CPP
#define __STATISTICS_CPP

#define _USE_MATH_DEFINES

#include "statistics.h"
#include <iostream>

StatisticalDistribution::StatisticalDistribution() {}
StatisticalDistribution::~StatisticalDistribution() {}

// Constructor/destructor
StandardNormalDistribution::StandardNormalDistribution() {}
StandardNormalDistribution::~StandardNormalDistribution() {}

// Probability density function
double StandardNormalDistribution::pdf(const double& x) const {
return (1.0/sqrt(2.0 * M_PI)) * exp(-0.5*x*x);
}

// Cumulative density function
double StandardNormalDistribution::cdf(const double& x) const {
double k = 1.0/(1.0 + 0.2316419*x);
double k_sum = k*(0.319381530 + k*(-0.356563782 + k*(1.781477937 + k*(-1.821255978 + 1.330274429*k))));

if (x >= 0.0) {
return (1.0 - (1.0/(pow(2*M_PI,0.5)))*exp(-0.5*x*x) * k_sum);
} else {
return 1.0 - cdf(-x);
}
}

// Inverse cumulative distribution function (aka the probit function)
double StandardNormalDistribution::inv_cdf(const double& quantile) const {
// This is the Beasley-Springer-Moro algorithm which can
// be found in Glasserman . We won't go into the
// details here, so have a look at the reference for more info
static double a = {   2.50662823884,
-18.61500062529,
41.39119773534,
-25.44106049637};

static double b = {  -8.47351093090,
23.08336743743,
-21.06224101826,
3.13082909833};

static double c = {0.3374754822726147,
0.9761690190917186,
0.1607979714918209,
0.0276438810333863,
0.0038405729373609,
0.0003951896511919,
0.0000321767881768,
0.0000002888167364,
0.0000003960315187};

if (quantile >= 0.5 && quantile <= 0.92) {
double num = 0.0;
double denom = 1.0;

for (int i=0; i<4; i++) {
num += a[i] * pow((quantile - 0.5), 2*i + 1);
denom += b[i] * pow((quantile - 0.5), 2*i);
}
return num/denom;

} else if (quantile > 0.92 && quantile < 1) {
double num = 0.0;

for (int i=0; i<9; i++) {
num += c[i] * pow((log(-log(1-quantile))), i);
}
return num;

} else {
return -1.0*inv_cdf(1-quantile);
}
}

// Expectation/mean
double StandardNormalDistribution::mean() const { return 0.0; }

// Variance
double StandardNormalDistribution::var() const { return 1.0; }

// Standard Deviation
double StandardNormalDistribution::stdev() const { return 1.0; }

// Obtain a sequence of random draws from this distribution
void StandardNormalDistribution::random_draws(const std::vector& uniform_draws,
std::vector& dist_draws) {
// The simplest method to calculate this is with the Box-Muller method,
// which has been used procedurally in many other articles on QuantStart.com

// Check that the uniform draws and dist_draws are the same size and
// have an even number of elements (necessary for B-M)
if (uniform_draws.size() != dist_draws.size()) {
std::cout << "Draws vectors are of unequal size in standard normal dist." << std::endl;
return;
}

// Check that uniform draws have an even number of elements (necessary for B-M)
if (uniform_draws.size() % 2 != 0) {
std::cout << "Uniform draw vector size not an even number." << std::endl;
return;
}

// Slow, but easy to implement
for (int i=0; i<uniform_draws.size() / 2; i++) {
dist_draws[2*i] = sqrt(-2.0*log(uniform_draws[2*i])) * sin(2*M_PI*uniform_draws[2*i+1]);
dist_draws[2*i+1] = sqrt(-2.0*log(uniform_draws[2*i])) * cos(2*M_PI*uniform_draws[2*i+1]);
}

return;
}
#endif

I'll discuss briefly some of the implementations here. The cumulative distribution function (cdf) is referenced from Joshi. It is an approximation, rather than closed-form solution. The inverse CDF (inv_cdf) makes use of the Beasley-Springer-Moro algorithm, which I coded up directly from the implementation in Korn. A similar method can be found in Joshi. Once again the algorithm is an approximation to the real function, rather than a closed form solution. The final method is random_draws. In this instance we are using the Box-Muller algorithm. However, we could instead utilise the more efficient Ziggurat algorithm, but that will be a potential discussion for a later article!

The Main Listing

We will now utilise the new statistical distribution classes with a simple random number generator in order to output statistical values:

#include "statistics.h"
#include <iostream>
#include <vector>

int main(int argc, char **argv) {

// Create the Standard Normal Distribution and random draw vectors
StandardNormalDistribution snd;
std::vector<double> uniform_draws(20, 0.0);
std::vector<double> normal_draws(20, 0.0);

// Simple random number generation method based on RAND
for (int i=0; i<uniform_draws.size(); i++) {
uniform_draws[i] = rand() / static_cast<double>(RAND_MAX);
}

// Create standard normal random draws
// Notice that the uniform draws are unaffected. We have separated
// out the uniform creation from the normal draw creation, which
// will allow us to create sophisticated random number generators
// without interfering with the statistical classes
snd.random_draws(uniform_draws, normal_draws);

// Output the values of the standard normal random draws
for (int i=0; i<normal_draws.size(); i++) {
std::cout << normal_draws[i] << std::endl;
}

return 0;
}

The output from the program is as follows (a sequence of normally distributed random variables):

3.56692
3.28529
0.192324
-0.723522
1.10093
0.217484
-2.22963
-1.06868
-0.35082
0.806425
-0.168485
-1.3742
0.131154
0.59425
-0.449029
-2.37823
0.0431789
0.891999
0.564585
1.26432

Now that we have set up the inheritance hierarchy, in later articles we will construct additional (continuous) statistical distributions, such as the log-normal distribution, the gamma distribution and the chi-square distribution.