Cointegration and the Two-Step Method

Stationarity and why we would want it?

Time-series data can be unpredictable and chaotic. To better understand them we look for statistical characteristics which allow us to better make sense of the time-series. When able to describe the data, we are more confident in applying inferential statistics (making predictions). When we say a time-series is stationary we are saying that over time that certain statistical characteristics are constant. For our application we are using a weakly stationary process rather than strictly stationary. The difference being that a strictly stationary process has equivalent characteristics for its distribution throughout, while a weakly stationary process has equivalence in a handful such as, mean, variance, and covariance. It’s a more practical approach to real world data as strictly stationary processes don’t tend to exist in financial data.

Weakly stationary process

We define \(Y = \{Y_t\}\) to be a stochastic process. The process is stationary if the below criteria are satisfied.

\(E[Y_t] = \mu < \inf\)
\(Cov(Y_t, Y_{t-h}) = \gamma(h) < \inf\)

The time series has to have a constant \(\mu\), and when selecting a value h (think defined range of data) for the time-series the covariance should be constant as well. As you shift around the time-series with that same constant h the covariance should remain the same no matter where you are on the time-series. The covariance function may bed different if h is changed.

Using unit-root tests for determining stationarity

Given a time-series, tests can be done to determine if the process is stationary or not. Suppose we fit a linear model between two assets and rearrange it as such:
\[ Z = \{Y_{t_1}\} + \phi_{1} \{Y_{t2}\}\] We then want to test to see if whether Z is a stationary process using the KPSS and PP tests. A unit root test is a hypothesis test for stationarity based on whether the coefficient has a unit root. For Z to be stationary it has to be that \(\phi_1\) has no roots on the unit circle on the complex plane. If \(|\phi_1| < 1\) then the process is stationary. If \(\phi_1 = 1\) then the process has a unit root, and therefore non-stationary. We care about the value \(\phi_1\) because when \(\phi_1\) is equal to one Z follows a random walk, when greater than one Z will explode. When \(\phi_1\) is less than one Z will revert back to the mean, with the value of \(\phi_1\) determining the speed of reversion. When we iterate through pairs of assets and fit linear models, below are the two tests we will use in R in regards to testing for stationarity.

KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test

\[H_0: \text{the process is stationary, } H_1: \text{the process has a unit root} \]

Phillips-Perron test

\[H_0: \text{the process has a unit root, } H_1: \text{the process is stationary} \]

Cointegration

Looking at stocks it’s easy to see that it’s not stationary.

library(quantmod)
df1 <- getSymbols("GOOG", from="2016-01-03", to="2022-10-01", env=NULL, timeout=30)
plot(df1[,6], main="GOOG")

There is a clear trend upward, meaning that a constant \(\mu \text{ or } \gamma(h)\) don’t exist. So applying any type of forecast utilizing a stationary process would not work. Knowing that stocks typically don’t follow a stationary process we look for a work around. We look for two assets that are not stationary independently, but are stationary as a linear combination of each other. This new stationary process is called cointegration. We are going to utilize the two-step method to find a cointegration relationship between pairs of assets.

The two-step method fits a linear regression model of two assets with one being the response variable and the other being the predictor. We re-order the linear model in terms of the residuals which we call Z. Z is now in terms of processes \(\{Y_{t,1}\} \text{ and } \{Y_{t,2}\}\) along with their intercept term. The test’s we use to determine whether this new process Z is stationary are the Phillips-Ouliaris cointegration test and the KPSS test. These tests can be conveniently done in R.

We are after finding a cointegrated relationship between two assets, because it allows us to find a statistical arbitrage opportunity. If we find a pair of assets which are cointegrated we can assume that there is going to be mean-reversion under the assumption the process continues to be stationary. This means a play can be made at user-defined levels (Long/Short) for the unique cointegrated relationship to revert back to the mean.

Experiment

The code below searches through the S&P500 for a combination of two assets that are potentially cointegrated. It will iterate over all pair combinations of assets, searching for a cointegrated model Z. For the new model Z to be considered cointegrated we say that it has to pass the Phillips-Ouliaris cointegration test, and KPSS test.

library(quantmod)
library(forecast)
library(urca)
library(tseries)
library(BatchGetSymbols)

set.seed(100)
start <- "2016-01-03"
end <- "2022-10-01"
df <- data.frame(ticker1 = character(),
                 ticker2 = character(),
                 kpss = numeric(),
                 one_pct = numeric(), 
                 P = numeric(),
                 rank = numeric())

sp500 <- GetSP500Stocks()
for (i in 1:length(sp500$Tickers)){
  ticker1 <- sp500$Tickers[i]
  x1 <- tryCatch(
    {
      df1 <- getSymbols(c(ticker1), from=start, to=end, env=NULL, timeout=30)
    },
    error = function(e) {
      print('error with ticker')
    }
  )
  if (length(x1)>2){
    
  }
  else{
    Sys.sleep(1)
    next
  }
  for (j in (i+1):length(sp500$Tickers)){
    cat(ticker1, sp500$Tickers[j],"\n")
    ticker2 <- sp500$Tickers[j]
    
    x2 <- tryCatch(
      {
        df2 <- getSymbols(c(ticker2), from=start, to=end, env=NULL, timeout=30)
      },
      error = function(e) {
        print('error with ticker')
      }
    )
    if (length(x2)>2){
      #setup dataframe for both tickers
      prices <- na.omit(merge(df1[,6],df2[,6]))
      names(prices) <- c(ticker1, ticker2)
      #fit a linear regression for our given ticker to iterated ticker
      fit <- lm(prices[,1]~prices[,2])
      #our phi_1 coefficient
      cc <- coefficients(fit)[[2]]
      #define our new Z function which is our cointegration
      z <- as.numeric(prices[,1] - cc*prices[,2])
      #extract our Philips Olarius and 1pct critical val
      coint.po <- ca.po(prices, demean="constant")
      wow <- summary(coint.po)
      one_pct <- wow@cval[3]
      P <- wow@teststat
      value <- kpss.test(z)
      kpss <- value$p.value
      #check weather P is greater than the one percent critical value
      #check weather KPSS Null hypothesis is not rejected for being stationary 
      if((P > one_pct)&(kpss >= 0.1)){
        cat(P, one_pct, kpss, "\n")
        rank <- P-one_pct
        df[nrow(df) + 1,] <- c(ticker1=as.character(ticker1),ticker2=as.character(ticker2),
                               kpss = as.numeric(kpss),
                               one_pct=as.numeric(one_pct),
                               P=as.numeric(P), Rank=as.numeric(rank))
        
      }
    }
    else{
      Sys.sleep(1)
      next
    }
  }
}
write.csv(df, "sp500_project.csv", row.names = FALSE)

Results

For this particular run we find the results below.

ticker1	ticker2	kpss	one_pct	P	rank
ABT	CARR	0.1	48.0021	52.27846	4.276361
HD	LIN	0.1	48.0021	75.24260	27.240505
HD	RMD	0.1	48.0021	51.73153	3.729431

Here are 3 examples which have passed the tests which were required. Since the data used ended in October of 2022 we can output the plots showing how the processes played out through April of 2023.

library(quantmod)
library(forecast)
library(urca)
library(tseries)
library(BatchGetSymbols)

combinations <- data.frame(ticker1 = c("ABT", "HD", "HD"),
                 ticker2 = c("CARR", "LIN", "RMD"))
start <- "2016-01-03"
done_data <- "2022-10-01"
end <- "2023-10-01"

for (i in 1:length(combinations[,1])){
ticker1 <- combinations[i,1]
ticker2 <- combinations[i,2]
#extract original mu
df1 <- getSymbols(c(ticker1), from=start, to=done_data, env=NULL, timeout=30)
df2 <- getSymbols(c(ticker2), from=start, to=done_data, env=NULL, timeout=30)
prices <- na.omit(merge(df1[,6],df2[,6]))
names(prices) <- c(ticker1, ticker2)
#fit a linear regression for our given ticker to iterated ticker
fit <- lm(prices[,1]~prices[,2])
original_cc <- coefficients(fit)[[2]]
true_mu <- coefficients(fit)[[1]]

#test how process went forward
df1 <- getSymbols(c(ticker1), from=start, to=end, env=NULL, timeout=30)
df2 <- getSymbols(c(ticker2), from=start, to=end, env=NULL, timeout=30)
prices <- na.omit(merge(df1[,6],df2[,6]))
z <- as.numeric(prices[,1] - original_cc*prices[,2])
dates <- as.Date(index(prices))
plot(dates, z, type="l", ylab="Z",xlab="time", main=paste(ticker1,"/",ticker2))
#our intercept from our linear regression model
abline(h = true_mu, col="red")
abline(v=as.numeric(as.Date("2022-10-01")), col="red")
}

The horizontal red line represents the mean for the two assets, and the vertical red line is where the data was run through to create our model(October 2022). Anything to the right of the vertical red line is how the pair behaved afterwards using our original model. The index’s are different due to some ticker symbols not having existed for the full time-frame which the experiment was conducted. Looking at the three plots we see that all had at least one attempt in reverting back to the mean. The second plots seem to deviate away after the first mean reversion while the first and last plot seems to revert back to the mean at least once. This makes sense as we wouldn’t expect arbitrage opportunities stay consistent for long.