Time-series data can be unpredictable and chaotic. To better understand them we look for statistical characteristics which allow us to better make sense of the time-series. When able to describe the data, we are more confident in applying inferential statistics (making predictions). When we say a time-series is stationary we are saying that over time that certain statistical characteristics are constant. For our application we are using a weakly stationary process rather than strictly stationary. The difference being that a strictly stationary process has equivalent characteristics for its distribution throughout, while a weakly stationary process has equivalence in a handful such as, mean, variance, and covariance. It’s a more practical approach to real world data as strictly stationary processes don’t tend to exist in financial data.
We define \(Y = \{Y_t\}\) to be a
stochastic process. The process is stationary if the below criteria are
satisfied.
The time series has to have a constant \(\mu\), and when selecting a value h (think defined range of data) for the time-series the covariance should be constant as well. As you shift around the time-series with that same constant h the covariance should remain the same no matter where you are on the time-series. The covariance function may bed different if h is changed.
Given a time-series, tests can be done to determine if the process is
stationary or not. Suppose we fit a linear model between two assets and
rearrange it as such:
\[ Z = \{Y_{t_1}\} + \phi_{1}
\{Y_{t2}\}\] We then want to test to see if whether Z is a
stationary process using the KPSS and PP tests. A unit root test is a
hypothesis test for stationarity based on whether the coefficient has a
unit root. For Z to be stationary it has to be that \(\phi_1\) has no roots on the unit circle on
the complex plane. If \(|\phi_1| <
1\) then the process is stationary. If \(\phi_1 = 1\) then the process has a unit
root, and therefore non-stationary. We care about the value \(\phi_1\) because when \(\phi_1\) is equal to one Z follows a random
walk, when greater than one Z will explode. When \(\phi_1\) is less than one Z will revert
back to the mean, with the value of \(\phi_1\) determining the speed of
reversion. When we iterate through pairs of assets and fit linear
models, below are the two tests we will use in R in regards to testing
for stationarity.
\[H_0: \text{the process is stationary, } H_1: \text{the process has a unit root} \]
\[H_0: \text{the process has a unit root, } H_1: \text{the process is stationary} \]
Looking at stocks it’s easy to see that it’s not stationary.
library(quantmod)
df1 <- getSymbols("GOOG", from="2016-01-03", to="2022-10-01", env=NULL, timeout=30)
plot(df1[,6], main="GOOG")
There is a clear trend upward, meaning that a constant \(\mu \text{ or } \gamma(h)\) don’t exist. So applying any type of forecast utilizing a stationary process would not work. Knowing that stocks typically don’t follow a stationary process we look for a work around. We look for two assets that are not stationary independently, but are stationary as a linear combination of each other. This new stationary process is called cointegration. We are going to utilize the two-step method to find a cointegration relationship between pairs of assets.
The two-step method fits a linear regression model of two assets with one being the response variable and the other being the predictor. We re-order the linear model in terms of the residuals which we call Z. Z is now in terms of processes \(\{Y_{t,1}\} \text{ and } \{Y_{t,2}\}\) along with their intercept term. The test’s we use to determine whether this new process Z is stationary are the Phillips-Ouliaris cointegration test and the KPSS test. These tests can be conveniently done in R.
We are after finding a cointegrated relationship between two assets, because it allows us to find a statistical arbitrage opportunity. If we find a pair of assets which are cointegrated we can assume that there is going to be mean-reversion under the assumption the process continues to be stationary. This means a play can be made at user-defined levels (Long/Short) for the unique cointegrated relationship to revert back to the mean.
The code below searches through the S&P500 for a combination of two assets that are potentially cointegrated. It will iterate over all pair combinations of assets, searching for a cointegrated model Z. For the new model Z to be considered cointegrated we say that it has to pass the Phillips-Ouliaris cointegration test, and KPSS test.
library(quantmod)
library(forecast)
library(urca)
library(tseries)
library(BatchGetSymbols)
set.seed(100)
start <- "2016-01-03"
end <- "2022-10-01"
df <- data.frame(ticker1 = character(),
ticker2 = character(),
kpss = numeric(),
one_pct = numeric(),
P = numeric(),
rank = numeric())
sp500 <- GetSP500Stocks()
for (i in 1:length(sp500$Tickers)){
ticker1 <- sp500$Tickers[i]
x1 <- tryCatch(
{
df1 <- getSymbols(c(ticker1), from=start, to=end, env=NULL, timeout=30)
},
error = function(e) {
print('error with ticker')
}
)
if (length(x1)>2){
}
else{
Sys.sleep(1)
next
}
for (j in (i+1):length(sp500$Tickers)){
cat(ticker1, sp500$Tickers[j],"\n")
ticker2 <- sp500$Tickers[j]
x2 <- tryCatch(
{
df2 <- getSymbols(c(ticker2), from=start, to=end, env=NULL, timeout=30)
},
error = function(e) {
print('error with ticker')
}
)
if (length(x2)>2){
#setup dataframe for both tickers
prices <- na.omit(merge(df1[,6],df2[,6]))
names(prices) <- c(ticker1, ticker2)
#fit a linear regression for our given ticker to iterated ticker
fit <- lm(prices[,1]~prices[,2])
#our phi_1 coefficient
cc <- coefficients(fit)[[2]]
#define our new Z function which is our cointegration
z <- as.numeric(prices[,1] - cc*prices[,2])
#extract our Philips Olarius and 1pct critical val
coint.po <- ca.po(prices, demean="constant")
wow <- summary(coint.po)
one_pct <- wow@cval[3]
P <- wow@teststat
value <- kpss.test(z)
kpss <- value$p.value
#check weather P is greater than the one percent critical value
#check weather KPSS Null hypothesis is not rejected for being stationary
if((P > one_pct)&(kpss >= 0.1)){
cat(P, one_pct, kpss, "\n")
rank <- P-one_pct
df[nrow(df) + 1,] <- c(ticker1=as.character(ticker1),ticker2=as.character(ticker2),
kpss = as.numeric(kpss),
one_pct=as.numeric(one_pct),
P=as.numeric(P), Rank=as.numeric(rank))
}
}
else{
Sys.sleep(1)
next
}
}
}
write.csv(df, "sp500_project.csv", row.names = FALSE)
For this particular run we find the results below.
ticker1 | ticker2 | kpss | one_pct | P | rank |
---|---|---|---|---|---|
ABT | CARR | 0.1 | 48.0021 | 52.27846 | 4.276361 |
HD | LIN | 0.1 | 48.0021 | 75.24260 | 27.240505 |
HD | RMD | 0.1 | 48.0021 | 51.73153 | 3.729431 |
Here are 3 examples which have passed the tests which were required. Since the data used ended in October of 2022 we can output the plots showing how the processes played out through April of 2023.
library(quantmod)
library(forecast)
library(urca)
library(tseries)
library(BatchGetSymbols)
combinations <- data.frame(ticker1 = c("ABT", "HD", "HD"),
ticker2 = c("CARR", "LIN", "RMD"))
start <- "2016-01-03"
done_data <- "2022-10-01"
end <- "2023-10-01"
for (i in 1:length(combinations[,1])){
ticker1 <- combinations[i,1]
ticker2 <- combinations[i,2]
#extract original mu
df1 <- getSymbols(c(ticker1), from=start, to=done_data, env=NULL, timeout=30)
df2 <- getSymbols(c(ticker2), from=start, to=done_data, env=NULL, timeout=30)
prices <- na.omit(merge(df1[,6],df2[,6]))
names(prices) <- c(ticker1, ticker2)
#fit a linear regression for our given ticker to iterated ticker
fit <- lm(prices[,1]~prices[,2])
original_cc <- coefficients(fit)[[2]]
true_mu <- coefficients(fit)[[1]]
#test how process went forward
df1 <- getSymbols(c(ticker1), from=start, to=end, env=NULL, timeout=30)
df2 <- getSymbols(c(ticker2), from=start, to=end, env=NULL, timeout=30)
prices <- na.omit(merge(df1[,6],df2[,6]))
z <- as.numeric(prices[,1] - original_cc*prices[,2])
dates <- as.Date(index(prices))
plot(dates, z, type="l", ylab="Z",xlab="time", main=paste(ticker1,"/",ticker2))
#our intercept from our linear regression model
abline(h = true_mu, col="red")
abline(v=as.numeric(as.Date("2022-10-01")), col="red")
}
The horizontal red line represents the mean for the two assets, and the vertical red line is where the data was run through to create our model(October 2022). Anything to the right of the vertical red line is how the pair behaved afterwards using our original model. The index’s are different due to some ticker symbols not having existed for the full time-frame which the experiment was conducted. Looking at the three plots we see that all had at least one attempt in reverting back to the mean. The second plots seem to deviate away after the first mean reversion while the first and last plot seems to revert back to the mean at least once. This makes sense as we wouldn’t expect arbitrage opportunities stay consistent for long.