GitHub Repo of project: https://github.com/RooFernando/CFRM_523_project

Note: This final project is building off the original paper, which was replicated.

Title: Constructing Cointegrated Cryptocurrency Portfolios for Statistical Arbitrage
Authors: Tim Leung, Hung Nguyen

Summary of strategy:

The original proposal objective was to identify prime stat-arb trading hours for Cryptocurrencies. This makes the inherent assumption that the cointegration strategy is profitable, and we were looking for hours of the day in which cointegrated relationships were abundant. Once we found a period which had the most cointegrated relationships, trading of the strategy would be done within that window of time. The assets being tested were originally BTC, ETH, LTC, BCH, and XRP. However, this changed to BTC, ETH, LTC, and SOL due to data quality issues on the minute scale. Just as in the original paper, spreads consisting of varying combinations of the available assets were traded.

Through the process this objective changed in a few ways. Originally we were to split the project into 2 parts. Part 1 being finding a two hour window of time that had the most cointegrated relationships between assets. Part 2 was to backtest the cointegration strategy on the optimal times found in part 1.

Working through the project, this methodology was flawed in a few ways. The largest flaw was the assumption that the cointegration trading strategy was profitable. Some spreads are profitable! But that quickly vanishes once a reasonable commission of 0.1% per transaction was added. We also found the need for splitting into 2 parts redundant. Optimizing the strategy with the objective of maximizing net profit, with adjusting the parameters of time of day to trade, and window of time to trade (how long each day to trade) essentially combines the previously stated parts together.

So going forward the strategy being tested was the cointegrated relationship between the assets BTC, ETH, LTC, and SOL. Adding to the the original signal of crossing upper/lower bounds of cointegrated relationships would be a signal from machine learning forecast. Once a boundary is crossed, we only enter the trade if a seperate ML forecast of the spread is also in the direction of mean reversion. Moving forward the traditional cointegration strategy will be compared against itself with the addition of ML forecast signal.

## ['BTC-USD', 'ETH-USD']
## ['BTC-USD', 'LTC-USD']
## ['BTC-USD', 'SOL-USD']
## ['ETH-USD', 'LTC-USD']
## ['ETH-USD', 'SOL-USD']
## ['LTC-USD', 'SOL-USD']
## ['BTC-USD', 'ETH-USD', 'LTC-USD']
## ['BTC-USD', 'ETH-USD', 'SOL-USD']
## ['BTC-USD', 'LTC-USD', 'SOL-USD']
## ['ETH-USD', 'LTC-USD', 'SOL-USD']
## ['BTC-USD', 'ETH-USD', 'LTC-USD', 'SOL-USD']

The above are all combinations of the assets being analyzed. There are a total of 11 spreads. We hope to find a profitable spread.

Our indicators:

  • Cointegration: Verifying stationarity and establishing upper and lower bounds
    • Augmented Dickey Fuller (ADF)
    • Philiphs Peron (PP)
    • (KPSS)
  • Machine Learning price forecast
    • Each asset will have a DNN trained to forecast the price t+60 (minutes) ahead.

Signal process and rules:

Cointegration:
Once a cointegrated relationship is identified the signal of buying and selling is dependent on the spread crossing the established lower or upper boundary. This boundary will be dependent on the sigma, which will be amongst one of the parameters to optimize for, just as done in the original paper.

Machine Learning:
Spreads will be calculated based on each DNN asset price forecast and used for sizing in trade. Example: If spread crosses lower boundary (we want to buy the spread) and our spread forecast of t+60 is in the direction of mean reversion, we buy 1 unit of the spread. The strategy is still cointegration, but we use ML for confidence in mean reversion.

Our constraints, benchmarks, and objectives of the strategy

Constraints:
- Stop loss of 10%. - Trailing stop loss of 5%. - Close out trades at later defined period of time (only trade within optimized time). - Cointegration model is based on history lookback window (time needed for formulation). - Buying and Selling of one Unit of a defined spread. - Entering and exiting trade occurs: - Stopped out. - Reaching the other boundary, example: buy 1 unit of spread at lower boundary, exit trade once spread reaches the upper boundary.

Objective: - Maximize Net Profit. - When optimizing for parameters: history lookback, trading window, and sigma values, all optimizations on objective of Maximizing Net Profit.

Benchmark: - The original cointegration strategy will be used as the benchmark. - The cointegration strategy + ML is the contender.

The Data

  • Strategy is based on minute data collected from 3/1/24 to 4/30/24 (2 months of data)
  • Assets include: BTC, ETH, LTC, SOL
  • All data obtained from Coinbase API
  • For the ML features matrix, RSI, and pct returns of the minute data are also calculated
##         date   ticker      open      high       low     close       volume
## 0 2024-03-01  BTC-USD  61179.03  61240.13  61162.83  61240.13    67.244532
## 1 2024-03-01  LTC-USD     79.99     80.23     79.99     80.23   473.582634
## 2 2024-03-01  SOL-USD    125.74    126.14    125.68    126.08  1108.448844
## 3 2024-03-01  ETH-USD   3341.78   3344.53   3340.15   3343.95   164.428233

Indicators, test indicators separately from the strategy

Indicator 1: Cointegration upper and lower thresholds - Applying the two step method per spread, we define a upper and lower bound defined by a sigma away from the mean in each direction.

Indicator 2: ML 60 minute forward forecast - Once spread has crossed either upper or lower boundary from the formulated cointegrated relationship, ML model is triggered to forecast 60 minutes ahead. - Model is only run for predictions when cointegration boundaries are crossed.

Testing of these indicators separately are done in the final results.

Signal process, test signal process separately from the overall strategy

## ADF BTC : 0.0
## Phillips Perron BTC: 0.0
## KPSS BTC: 0.8014388295028906
## ADF ETH : 0.0
## Phillips Perron ETH: 0.0
## KPSS ETH: 0.7962983160499207
## ADF LTC : 9.216253748417339e-21
## Phillips Perron LTC: 0.0
## KPSS LTC: 0.07677680498885443
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
## coefficients:  [  1.         -17.07767596  46.483788  ]
## intercept:  7623.439888223496

Like in the paper replication we test the once differenced asset prices for stationarity. ADF and PP null hypothesis of non-stationarity are all rejected above, and KPSS null hypothesis of stationary is accepted for all 3 assets.

Our coefficients come out to be: [ 1.0, -17.07767596, 46.483788 ]

We see from the image above a cointegrated relationship between the assets BTC, ETH, and LTC over 12 hours.

Signal 1: Using Cointegration
Hypothetical Cointegrated Relationship from the calculations above: \(S_t = 1 BTC_t - 17 ETH_t + 46LTC_t\)

  • A buy signal is triggered when spread value crosses lower bound from cointegrated relationship.
    • We want to go Long the spread
      • +1 unit of \(S_t\) = Buy 1 BTC, Sell 17 ETH, Buy 46 LTC
  • A sell signal is triggered when the spread value crosses the upper bound from the cointegrated relationship.
    • We want to go Short the spread
      • -1 unit of \(S_t\) = Sell 1 BTC, Buy 17 ETH, Sell 46 LTC

Signal 2: Using Machine Learning (Disclaimer: Signal 2 is only triggered if Signal 1 is triggered)
We follow the same hypothetical cointegrated relationship from above.

  • The objective is to forecast what value the spread will be 60 minutes ahead, this helps us as we hope that it would be a reassuring signal that the spread is headed in the direction of reversion.
  • A Sequential Neural Network has been trained on each specific Crypto Currency (BTC, ETH, LTC, SOL)
  • Each asset is forecasted individually
  • The Spread value 60 minutes ahead is calculated using individual asset forecasts 60 minutes ahead.
  • Example Forecast: \(S_{t+60}\) = 1 \(BTC_{t+60} - 17 ETH_{t+60} + 46 LTC_{t+60}\)

Describing the rule process, test rules incrementally

The rules of the strategy are similar to the original paper replication.

  1. We enter once a upper or lower boundary is crossed.
  2. We exit once the opposite upper or lower boundary is crossed.
  3. What changes is we now only enter once we have confirmation from the ML forecast, stating that the spread is moving in the direction of mean-reversion 60 minutes ahead.

We test the rules incrementally by running two analysis.

  1. Traditional trading of the Cointegration strategy.
  2. The above strategy with the addition of the ML signal.

Assess optimization of parameters

Parameters available for optimization:

  • History lookback window (formulation):
    • This is the amount of bars looking back for each asset to apply the two-step method.
    • Example: value of 4 would entail that BTC, ETH, LTC, SOL each will all have the last 240 closing prices (4 hours *60 minutes).
    • This is an essential parameter because it defines the cointegrated relationship, and thus the entry and exit points for trading.
  • Window of trading:
    • This value is what period of time is the strategy allowed to trade (starting at 00:00:00 UTC).
    • Example value of 8 would mean the strategy can trade from 00:00:00 to 08:00:00.
  • Sigma value:
    • This sets how far the upper and lower boundary is from the established cointegrated spreads mean.
    • Example: value of 3 would mean an upper boundary of adding \(3\sigma\) to the mean \(\mu\) of the cointegrated spread. And oppositely 3\(\sigma\) would be subtracted from the mean for defining the lower boundary.
    • This parameter is essential, especially dealing with commissions. A smaller \(\sigma\) might capture less profit which can easily diminish once commissions are introduced.

Apply walk forward analysis, discuss choice of objective function and impact on parameter choice

Walk forward analysis was implemented on optimizing for the \(\sigma\) value. The objective function was Total Net Profit.The purpose of applying walk forward analysis is to minimize overfitting of the parameters being optimized.

Implementation: Rolling window

\(\sigma\) = [1,1.5,2,2.5,3,3.5,4,4.5,5]

  • These values were all backtested in the training window.
  • The “winner” (\(\sigma\) value which maximized the objective function) will be used as the parameter \(\sigma\) on the test set. This continues until the full time length is finished.

Assess Overfitting

There is opportunity for overfitting on the training set especially in regards to parameters history lookback, and window of trading. This is because the entirety of March data was used when backtesting each spread once per parameter value. There is less opportunity for overfitting for the parameter sigma as it was obtained by a robust walk forward analysis.

ML models maybe prone to overfitting as well. Though the training and validation data seem to be solid for the test set in March, performance is likely to vary on the full testing of April.

Extend the analysis with other asset classes, additional similar techniques, or more sophisticated models

This process can be implemented on essentially any asset class. Pairs trading is quite common, we know this because most of the cohorts presentations were on pairs trading.

Adding sophistication:

  • Sizing based on ML forecast.
  • If the forecasting accuracy of each asset is accurate, we may want to add size onto the trade in the future. Maybe measure a slope of the forecasted spread value and score it appropriately.
  • Use ML forecast for volatility.
    • Dynamically adjust the stop/loss or trailing stop/loss based on the volatility forecast.
  • Applying this same process along with the same assets using the Copula, and Distance method would be interesting.

General process of the project

Assets: BTC, ETH, LTC, SOL

  1. We trade the cointegration strategy
  • Standards for optimization:
    • Train: March (3/1/24 - 3/30/24).
    • Trade only from 04:00:00 to 08:00:00 UTC.
    • All trades closed out after 08:00:00 UTC.
    • Stop/Loss = 0.15.
    • Trailing Stop/Loss = 0.05.
    • \(\simga\) = 3.
  1. Optimize for parameters on the standard above
    • History lookback (1,2,3,…24).
    • Window of trading (1,2,3,…24).
    • \(\sigma\) = [1,1.5,2,2.5,3,3.5,4,4.5,5].
  2. Train the ML
    • Full March data.
    • 60 features per Close price, RSI, minute return for each asset.
    • Our model will Forecast 60 minutes ahead.
  3. Testing on April
    • Use optimized parameters from March.
    • Test on Cointegrated strategy only.
      • Get results for cointegrated strategy.
    • Test on Cointegrated strategy + ML added.
      • Get results for cointegrated + ML strat.
  4. Conclusion
    • Which is better?
    • Does ML help?

Optimizing parameters in the month of March

History Lookback and Trading Window:
- History lookback is the amount of bars used to formulate our cointegrated relationship.
- Trading Window is the window of time trading is permitted.

Disclaimer:
These two parameters were optimized using the training period of March entirely. Optimizing Sigma uses the walk forward analysis.

Optimizing History Lookback

We choose the history lookback to be set at 21 hours, as 3 spreads are profitable with the non-profitable spreads looking to be at a local minima at 21 hours.

Optimizing Time Window for trading

Looking at the results we see that a good window of trading parameter can be identified at 18. Though this value doesn’t yield the highest Net Profit, it does show two spreads to be profitable. Most of the spreads are not profitable but at a trading window of 18 hours it shows the other spreads loose less money.

Optimizing Sigma Walk Forward Analysis

A description of the process

Walk Forward analysis was applied

  • Time Frame: 3/1/24 to 3/31/24
  • Training and Testing increments: 5 days each
  • Applied to all 11 spreads
  • \(\sigma\) values: [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]
  • Objective: maximize Total Net Profit

Summary: The walk forward method was utilized to find optimal parameters for sigma and reduce the likelihood of overfitting. A rolling window of 5 days were moved through for the entire month of March.

Example of the processs of the Walk Forward method for spread 1:

  1. Train: 3/1/24-3/6/24
    • Run separate backtests for all values of sigma [1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]
  2. A winner is chosen (and documented) from the above sigma values, based on maximum “Total Net Profit”
  3. That sigma value is then used to backtest over the testing period: 3/6/24-3/11/24
  4. That “Total Net Profit” is documented.
  5. At the end, an analysis can be done to find the ideal sigma value which occurs most frequently through the whole walk forward analysis.

Disclaimer: Though some sigma values in the table above show a “Total Net Profit” of 0 on the testing period (meaning no trades). This is not the case in the training period as backtests which resulted in 0 trades for sigma values were discarded.

Example of how it looks underneath

Picking our Sigma

Through analysis we conclude that a sigma value of 3.5 is ideal. Looking at the full data there isn’t much profitability to begin with. And the most occurring sigma values that come out of the training subsets are 3.5, and 5. However, a closer examination of those values we see that a sigma of 5 results in no trades with a Total Net Profit being $0. So an optimal sigma value of 3.5 is chosen.

Results seperately backtesting our two signals

  1. Cointegration
  2. Machine Learning Forecast

Cointegration only

We use the optimized parameters obtained from the previous section. Backtesting was done for both \(\sigma\) values of 3.5 and 5, with the lookback window being 21 hours, and window of trading per day being 18 hours (from midnight to 6 pm UTC).

First we use cointegration with no ML to see how our strategy played out in the month of April.

Backtest Results: optimized parameters and no machine learning

Trying the first Sigma for the month of April:

Lookback Window: 21
Window of Trading: 18
Sigma value: 3.5

Trying the second best Sigma for the month of April:

Lookback Window: 21
Window of Trading: 18
Sigma value: 5

We see that the results were almost all profitable for either case of sigma (3.5, 5).
Drawdown does generally decrease for a sigma value of 5. This can be the result of a decrease in trading as signals are less likely to be triggered with such a high sigma value.

Cointegration With Machine Learning (same period)

First Sigma for the month of April with ML:

Lookback Window: 21
Window of Trading: 18
Sigma value: 3.5

Second Sigma for the month of April with ML:

Lookback Window: 21
Window of Trading: 18
Sigma value: 5

Conclusion

Comparing the results, with and without ML

Sigma 3.5

Unfortunately nearly all of the spreads remain unprofitable, with the pair “LTC-USD, SOL-USD” slightly profiting over $0. However, 9 out of 11 spreads improved “Total Net Profit” and “Drawdown” with the addition of the ML forecasting. The Spreads which didn’t improve were “ETH-USD, SOL-USD” and “ETH-USD, LTC-USD, SOL-USD”.

Sigma 5

Again, nearly all of the spreads remain unprofitable as above. However, 10 out of 11 spreads improved in “Total Net Profit”, while 9 out of 11 improved in “Drawdown” with 1 spread remaining the same.

When optimizing for parameters in March there were 3 spreads that were clearly profitable. Those 3 spreads were not profitable whatsoever in April.

In conclusion we see that adding a machine learning signal does in fact help increase profitability (even though the spreads are not profitable). When conducting the original backtests, comissions were set to 0.1%, this value was chosen as several exchanges online had this fee. However towards the end of this project a classmate Jiachen had explained to me how RobinHood does not charge commision for trading crypto. With that knowledge, some of these spreads will be profitable as they were hovering just near profitability.