Paraphrasing investopedia.com, it defines statistical arbitrage (or stat arb) as a group of trading strategies that utilise mean reversion analyses to invest in portfolios of many securities for a very short period of time. Well what does this mean?

The key idea is mean reversion, that is to say that correlated assets will eventually return to their long run average. For example, say we believe that asset A and B are 100% correlated. So if asset A were to increase in price by 5% in one day, we would expect asset B to do the same. If this didn't happen, say asset B only appreciated by 2%, then we might expect there to be a price retracement in asset A. Given this assumption we can make automated trading decisions when we observe such a dislocation.

Let's dive into a bit of code to bring some life to this idea. Note: the code in this notebook won't run as I have redacted some inputs to protect IP.

First we need to get some libraries that are super handy for analysing data into scope.

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```

```
symbols = ['BTC-BUSD-PERP', 'ETH-BUSD-PERP', 'BNB-BUSD-PERP']
```

```
perp_df = data_reader.read(symbols=symbols,
start_datetime=pd.Timestamp("2022-1-1"),
end_datetime=pd.Timestamp("2022-11-7"),
freq='60s',
features='mid',
exchanges='BINANCE_LINEAR');
```

We can take a look at the pandas dataframe which holds the price data

```
perp_df
```

BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|

2022-01-01 00:01:00 | 46244.25 | 3683.645 | 512.405 |

2022-01-01 00:02:00 | 46313.75 | 3691.225 | 513.145 |

2022-01-01 00:03:00 | 46358.05 | 3692.275 | 513.545 |

2022-01-01 00:04:00 | 46328.95 | 3689.615 | 513.115 |

2022-01-01 00:05:00 | 46315.35 | 3689.385 | 512.885 |

... | ... | ... | ... |

2022-11-05 23:56:00 | 21270.95 | 1626.095 | 348.895 |

2022-11-05 23:57:00 | 21278.05 | 1626.895 | 349.135 |

2022-11-05 23:58:00 | 21280.75 | 1627.455 | 349.085 |

2022-11-05 23:59:00 | 21293.55 | 1626.655 | 349.145 |

2022-11-06 00:00:00 | 21293.55 | 1626.405 | 348.965 |

444960 rows × 3 columns

and we can plot the daily prices to make sure they're reasonable.

```
fig, axs = plt.subplots(1, len(symbols), figsize=(15, 6))
for i, symbol in enumerate(symbols):
perp_df[symbol].resample('D').last().plot(ax=axs[i], grid=True, title=symbol)
```

Looking at these price graphs alone already starts to raise our attention. Just with an eyeball you can see that the price action is very similar across these three assets; telling us they're highly correlated and "should" move together.

Since they're on different scales (and for some other nice mathematical properties) it's much more natural to use the returns of these price series as our starting point for investigation.

```
returns_df = perp_df.pct_change().dropna()
```

We can better see this idea of correlation if we plot the scatter plot of say ETH vs BNB.

```
returns_df.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)
```

<AxesSubplot:xlabel='ETH-BUSD-PERP', ylabel='BNB-BUSD-PERP'>

Visually we can see there is an obvious correlation between the two assets. When ETH returns are positive so are BNB returns and vice versa. We can even see there is a preferred direction along which they move. This preferred direction is called the first principal component.

Without getting too heavy into the mathematics we can work out this principal component with a litte bit of linear algebra.

Recall that the covariance matrix $\Sigma$ of centered (zero mean) matrix of returns $X$ is defined as

$$\Sigma = \mathbb{E}(\mathbf{X}^T \mathbf{X})$$where $\mathbb{E}$ is the expectation operator.

Once, we have this covariance matrix $\Sigma$ we can extract the component by taking its eigendecomposition

$$\Sigma = V \Lambda V^T$$where we have $V$ the matrix of eigenvectors and $\Lambda$ the diagonal matrix of eigenvalues. The eigenvector assosciated with the largest eigenvalue is our first principal component. We can demonstrate this visually

```
cov = returns_df[['ETH-BUSD-PERP', 'BNB-BUSD-PERP']].cov()
eigvals, eigvecs = np.linalg.eigh(cov)
returns_df.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)
plt.quiver(0, 0, -eigvecs[1,0], -eigvecs[1,1],scale=3)
```

<matplotlib.quiver.Quiver at 0x7fc1a31bd580>

So as we can see our principal component points nicely along the direction we would expect. In mathematical speak, we could say that this direction explains the largest amount of variance we see in the returns of ETH and BNB. In fact we could go as far as to say this direction is the line we *expect* returns to fall on, and if they don't there's been a deviation. So how can we use this information to make a trading strategy?

So we know how to extract our principal component, and we're also armed with the assumption that perhaps when asset returns deviate from this component they could revert, but how can we proceed? Well it's back to our trusted friend linear algebra.

We can project our returns onto this component and measure their deviation. For a centered returns matrix $\mathbf{X}$ and eigenvector $V$ this is equivalent to doing

$$P = \mathbf{X} V V^T$$which gives us something that looks like

```
eth_bnb = returns_df[['ETH-BUSD-PERP', 'BNB-BUSD-PERP']]
eth_bnb_avg = eth_bnb.mean(axis=0)
# This piece of code is equivalent to the projection equation above, with doing the centering step
proj = (eth_bnb - eth_bnb_avg) @ eigvecs[:,:-1] @ eigvecs[:,:-1].T + eth_bnb_avg.values
```

```
proj.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)
```

<AxesSubplot:xlabel='ETH-BUSD-PERP', ylabel='BNB-BUSD-PERP'>

How can we interpret what this line means? Well For each return of ETH of BNB in our data this is a visualisation of "how far" their respective returns have deviated from the expectation (ie. that principal component line). We can then introduce a simple rule that says, "I expect any deviation will revert exactly back to the expected value at the next interval". This simple rule will be our signal.

Ok I am going to switch back into using all three symbols of interest, rather than just ETH and BNB. What's great about mathematics is it allows us to take these intuitive geometric notions and generalise them to as many dimensions as we want.

One thing we have to be very careful of in trading is a concept known as "forward bias". This is equivalent to seeing the future, which unfortunately in a live trading system we cannot do.

So when we create our signal, we need to be careful to not use the future. How we can do this is by "rolling" our returns over a window of interest. We calculate the signal on this window and then use that signal at the *next* timestep.

Rolling our (now 3D) returns with a window of 2hrs looks like

```
window = 120 # 2hrs
rolled_returns = roll(returns_df, window)
```

We can display first window of 2hrs, denoted as $\text{step} = 0$

```
display_roll(step=0)
```

BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|

2022-01-01 00:02:00 | 0.001503 | 0.002058 | 0.001444 |

2022-01-01 00:03:00 | 0.000957 | 0.000284 | 0.000780 |

2022-01-01 00:04:00 | -0.000628 | -0.000720 | -0.000837 |

2022-01-01 00:05:00 | -0.000294 | -0.000062 | -0.000448 |

2022-01-01 00:06:00 | 0.002500 | 0.002331 | 0.001638 |

... | ... | ... | ... |

2022-01-01 01:57:00 | 0.000312 | 0.000150 | 0.000193 |

2022-01-01 01:58:00 | -0.000590 | -0.000414 | -0.001410 |

2022-01-01 01:59:00 | 0.000470 | 0.000497 | 0.000793 |

2022-01-01 02:00:00 | -0.000186 | -0.000099 | -0.000290 |

2022-01-01 02:01:00 | 0.000569 | 0.000827 | 0.001450 |

120 rows × 3 columns

Showing what $step = 1$ we can see we move forward in time by 1 minute

```
display_roll(step=1)
```

BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|

2022-01-01 00:03:00 | 0.000957 | 0.000284 | 0.000780 |

2022-01-01 00:04:00 | -0.000628 | -0.000720 | -0.000837 |

2022-01-01 00:05:00 | -0.000294 | -0.000062 | -0.000448 |

2022-01-01 00:06:00 | 0.002500 | 0.002331 | 0.001638 |

2022-01-01 00:07:00 | 0.001697 | 0.001736 | 0.000487 |

... | ... | ... | ... |

2022-01-01 01:58:00 | -0.000590 | -0.000414 | -0.001410 |

2022-01-01 01:59:00 | 0.000470 | 0.000497 | 0.000793 |

2022-01-01 02:00:00 | -0.000186 | -0.000099 | -0.000290 |

2022-01-01 02:01:00 | 0.000569 | 0.000827 | 0.001450 |

2022-01-01 02:02:00 | -0.000167 | -0.000236 | -0.000154 |

120 rows × 3 columns

```
covs, mus = calc_multidimensional_covariance(rolled_returns)
eigvals, eigvecs = np.linalg.eigh(covs)
V = eigvecs[:, :, :-1]
raw_signal = (rolled_returns - mus) @ V @ V.transpose(0, 2, 1) + mus
```

```
signal = np.clip(raw_signal / raw_signal.std(axis = 1, keepdims = True, ddof = 1), -6, 6)
```

```
signal_df = pd.DataFrame(signal[:, -1, :], columns = returns_df.columns, index = returns_df.index[window - 1:])
signal_df
```

BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|

2022-01-01 02:01:00 | -1.345710 | -0.414521 | 2.824633 |

2022-01-01 02:02:00 | 0.545039 | 0.288037 | 0.342169 |

2022-01-01 02:03:00 | 0.177717 | 0.162890 | 0.720455 |

2022-01-01 02:04:00 | -0.041850 | 0.535374 | 0.670429 |

2022-01-01 02:05:00 | 0.891872 | 0.244282 | 0.223873 |

... | ... | ... | ... |

2022-11-05 23:56:00 | -0.436605 | 0.219841 | -0.420987 |

2022-11-05 23:57:00 | 0.736302 | 0.353268 | -0.774764 |

2022-11-05 23:58:00 | 0.613871 | 1.115094 | -1.356724 |

2022-11-05 23:59:00 | 3.152448 | -1.941574 | 0.438164 |

2022-11-06 00:00:00 | 0.569008 | 0.165273 | -0.634369 |

444840 rows × 3 columns

So a few things to note whilst looking at our signal dataframe.

Firstly, you'll notice the first timestamp in the frame corresponds to the final timesamp in our first roll window. This makes sense, it was only at this time (when the first 2hrs had passed) that we had enough data to calculate a signal. This is a good sanity check to ensure we haven't included any future information.

Secondly, you'll notice that our signal can be both positive and negative. This corresponds to either taking a long or short position in either asset (and hence why we look at perpetual futures on centralised exchanges!)

Finally, you'll notice that the numbers are small, so we need to think about how we scale these up to be dollar amounts we wish to trade.

There are many ways to skin a stat arb cat, but I'm just going to demonstrate something simple here. Remember back when we created the intuition about deviations from the first principal component, and that these assets would *revert* back. This means our trades need to be in the opposite direction to our signal (which measures just their deviation). So we'll need to include a minus sign.

We also want to scale our raw signals into something that we can make sense of. USD is generally what we're trying to gain so let's scale it so that every signal we bet 3000 USD. That is to say, it doesn't matter if we're long or short, we only want our dollars at risk to be 3000 USD. Hence, we need to scale with the absolute value of the signal.

```
dollars_at_risk = 3000
positions = -signal_df.multiply((dollars_at_risk / signal_df.abs().sum(1)),axis=0)
positions
```

BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|

2022-01-01 02:01:00 | 880.534455 | 271.232119 | -1848.233427 |

2022-01-01 02:02:00 | -1391.297166 | -735.261197 | -873.441637 |

2022-01-01 02:03:00 | -502.468153 | -460.548815 | -2036.983032 |

2022-01-01 02:04:00 | 100.628351 | -1287.314163 | -1612.057485 |

2022-01-01 02:05:00 | -1967.324465 | -538.847394 | -493.828142 |

... | ... | ... | ... |

2022-11-05 23:56:00 | 1215.682310 | -612.124143 | 1172.193547 |

2022-11-05 23:57:00 | -1184.822222 | -568.463161 | 1246.714618 |

2022-11-05 23:58:00 | -596.824256 | -1084.127807 | 1319.047937 |

2022-11-05 23:59:00 | -1709.513113 | 1052.878743 | -237.608144 |

2022-11-06 00:00:00 | -1247.231733 | -362.268146 | 1390.500121 |

444840 rows × 3 columns

```
trades = positions.diff().fillna(positions.iloc[0])
trades
```

BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|

2022-01-01 02:01:00 | 880.534455 | 271.232119 | -1848.233427 |

2022-01-01 02:02:00 | -2271.831620 | -1006.493316 | 974.791789 |

2022-01-01 02:03:00 | 888.829012 | 274.712382 | -1163.541395 |

2022-01-01 02:04:00 | 603.096505 | -826.765348 | 424.925547 |

2022-01-01 02:05:00 | -2067.952816 | 748.466770 | 1118.229344 |

... | ... | ... | ... |

2022-11-05 23:56:00 | 2382.024364 | -80.684424 | -130.024681 |

2022-11-05 23:57:00 | -2400.504532 | 43.660982 | 74.521071 |

2022-11-05 23:58:00 | 587.997965 | -515.664647 | 72.333319 |

2022-11-05 23:59:00 | -1112.688856 | 2137.006550 | -1556.656081 |

2022-11-06 00:00:00 | 462.281379 | -1415.146889 | 1628.108266 |

444840 rows × 3 columns

The trades are important when handing over this sort of signal to a live execution engine, but for the purposes of research here we'll just use the positions.

So now that we have our positions we can actually simulate how those positions would have performed under some very serious (and unfortunately unrealistic) assumptions.

To value our positions we want to multiply them by the returns in the following minute. I have a preference for moving the returns back in time by one step, so that I can see what the gross PnL of a particular signal was, you might prefer to move your signal forward in time for it to align with when you would have realised the gross PnL. You say tomato, I say tomato.

```
(positions*returns_df.shift(-1)).cumsum().plot(grid=True)
```

<AxesSubplot:>

As we can see lines go bottom left to top right, what every quant is looking for. Let's take a look at the portfolio as a whole and and compare it with buying and holding 1000 USD of BTC, ETH and BNB.

NB. For the quick minded reader, you'll see I cumulatively summed percentage returns. Very naughty you might say, I should have used log returns or cumulatively multiplied. I say bite me I'm a maverick, or more accurately, that I am just betting 3000 USD each day with no compounding, whichever answer you prefer.

```
strategy_pnl = (positions*returns_df.shift(-1)).sum(1)
hold_pnl = (1000 * returns_df).sum(1)
create_pnl_srs(strategy_pnl, 'Strat')
create_pnl_srs(hold_pnl, 'BnH')
```

Ok so we outperformed the market by quite a bit. In fact on \$ 3000 we made

```
strat_gross_pnl = strategy_pnl.sum()
strat_returns = 100 * strategy_pnl.sum() / dollars_at_risk
strat_sr = np.sqrt(260) * strategy_pnl.resample('B').sum().mean() / strategy_pnl.resample('B').sum().std()
print(f"Gross Pnl: ${strat_gross_pnl:.2f}\nReturns: {strat_returns:.2f}%\nSharpe Ratio: {strat_sr:.2f}")
```

Gross Pnl: $12319.90 Returns: 410.66% Sharpe Ratio: 9.73

When compared with buy and hold which did

```
bnh_gross_pnl = hold_pnl.sum()
bnh_returns = 100 * hold_pnl.sum() / dollars_at_risk
bnh_sr = np.sqrt(260) * hold_pnl.resample('B').sum().mean() / hold_pnl.resample('B').sum().std()
print(f"Gross Pnl: ${bnh_gross_pnl:.2f}\nReturns: {bnh_returns:.2f}%\nSharpe Ratio: {bnh_sr:.2f}")
```

Gross Pnl: $-1239.96 Returns: -41.33% Sharpe Ratio: -0.69

Not bad at all! I hear you clamour. Why didn't we start trading yesterday? I hear you shout. Well, there's a catch, there's always a catch.

So we've demonstrated a process which you might follow as an easy introduction to statistical arbitrage, but unfortunately it's flawed for a multitude of reasons:

- Fees, slippage and spread. We didn't include any notion of costs in this strategy, which are always of order 1 importance. You'll find if you analyse the bps per trade, the edge this strategy has is too small to beat them.
- Instant fill. We assumed that we immediately got all our position on instanteously to capture the minute's whole return. This just isn't realistic and we need to include some notion of probabilistic fill time.
- Out of sample testing. Just because we had excellent results this year, we don't know if this strategy will hold up in time. We could have easily overfit to the data we have, so we should have following a proper train, validate, test process.
- Too simplistic. At the heart of it, this strategy is too simple. Thousands of others will and have tried it and that means that any alpha it has will be long since decayed away

But never fear dear reader, we can build upon this start. We can start to include more sophisticated signals, with multiple time horizons and mean-variance optimisation thrown in to boot. We can layer in other pieces of data and increase the universe of symbols and locations we trade. There is plenty more work to do here.