Paraphrasing investopedia.com, it defines statistical arbitrage (or stat arb) as a group of trading strategies that utilise mean reversion analyses to invest in portfolios of many securities for a very short period of time. Well what does this mean?
The key idea is mean reversion, that is to say that correlated assets will eventually return to their long run average. For example, say we believe that asset A and B are 100% correlated. So if asset A were to increase in price by 5% in one day, we would expect asset B to do the same. If this didn't happen, say asset B only appreciated by 2%, then we might expect there to be a price retracement in asset A. Given this assumption we can make automated trading decisions when we observe such a dislocation.
Let's dive into a bit of code to bring some life to this idea. Note: the code in this notebook won't run as I have redacted some inputs to protect IP.
First we need to get some libraries that are super handy for analysing data into scope.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
We'll also want to define some symbols, an exchange and a date range we're interested in. For the purposes of this post I am going to look at the Binance's linear futures 1 minute mid price of BTC, ETH and BNB quoted in BUSD, over the year of 2022 so far.
symbols = ['BTC-BUSD-PERP', 'ETH-BUSD-PERP', 'BNB-BUSD-PERP']
perp_df = data_reader.read(symbols=symbols,
start_datetime=pd.Timestamp("2022-1-1"),
end_datetime=pd.Timestamp("2022-11-7"),
freq='60s',
features='mid',
exchanges='BINANCE_LINEAR');
We can take a look at the pandas dataframe which holds the price data
perp_df
BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|
2022-01-01 00:01:00 | 46244.25 | 3683.645 | 512.405 |
2022-01-01 00:02:00 | 46313.75 | 3691.225 | 513.145 |
2022-01-01 00:03:00 | 46358.05 | 3692.275 | 513.545 |
2022-01-01 00:04:00 | 46328.95 | 3689.615 | 513.115 |
2022-01-01 00:05:00 | 46315.35 | 3689.385 | 512.885 |
... | ... | ... | ... |
2022-11-05 23:56:00 | 21270.95 | 1626.095 | 348.895 |
2022-11-05 23:57:00 | 21278.05 | 1626.895 | 349.135 |
2022-11-05 23:58:00 | 21280.75 | 1627.455 | 349.085 |
2022-11-05 23:59:00 | 21293.55 | 1626.655 | 349.145 |
2022-11-06 00:00:00 | 21293.55 | 1626.405 | 348.965 |
444960 rows × 3 columns
and we can plot the daily prices to make sure they're reasonable.
fig, axs = plt.subplots(1, len(symbols), figsize=(15, 6))
for i, symbol in enumerate(symbols):
perp_df[symbol].resample('D').last().plot(ax=axs[i], grid=True, title=symbol)
Looking at these price graphs alone already starts to raise our attention. Just with an eyeball you can see that the price action is very similar across these three assets; telling us they're highly correlated and "should" move together.
Since they're on different scales (and for some other nice mathematical properties) it's much more natural to use the returns of these price series as our starting point for investigation.
returns_df = perp_df.pct_change().dropna()
We can better see this idea of correlation if we plot the scatter plot of say ETH vs BNB.
returns_df.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)
<AxesSubplot:xlabel='ETH-BUSD-PERP', ylabel='BNB-BUSD-PERP'>
Visually we can see there is an obvious correlation between the two assets. When ETH returns are positive so are BNB returns and vice versa. We can even see there is a preferred direction along which they move. This preferred direction is called the first principal component.
Without getting too heavy into the mathematics we can work out this principal component with a litte bit of linear algebra.
Recall that the covariance matrix $\Sigma$ of centered (zero mean) matrix of returns $X$ is defined as
$$\Sigma = \mathbb{E}(\mathbf{X}^T \mathbf{X})$$where $\mathbb{E}$ is the expectation operator.
Once, we have this covariance matrix $\Sigma$ we can extract the component by taking its eigendecomposition
$$\Sigma = V \Lambda V^T$$where we have $V$ the matrix of eigenvectors and $\Lambda$ the diagonal matrix of eigenvalues. The eigenvector assosciated with the largest eigenvalue is our first principal component. We can demonstrate this visually
cov = returns_df[['ETH-BUSD-PERP', 'BNB-BUSD-PERP']].cov()
eigvals, eigvecs = np.linalg.eigh(cov)
returns_df.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)
plt.quiver(0, 0, -eigvecs[1,0], -eigvecs[1,1],scale=3)
<matplotlib.quiver.Quiver at 0x7fc1a31bd580>
So as we can see our principal component points nicely along the direction we would expect. In mathematical speak, we could say that this direction explains the largest amount of variance we see in the returns of ETH and BNB. In fact we could go as far as to say this direction is the line we expect returns to fall on, and if they don't there's been a deviation. So how can we use this information to make a trading strategy?
So we know how to extract our principal component, and we're also armed with the assumption that perhaps when asset returns deviate from this component they could revert, but how can we proceed? Well it's back to our trusted friend linear algebra.
We can project our returns onto this component and measure their deviation. For a centered returns matrix $\mathbf{X}$ and eigenvector $V$ this is equivalent to doing
$$P = \mathbf{X} V V^T$$which gives us something that looks like
eth_bnb = returns_df[['ETH-BUSD-PERP', 'BNB-BUSD-PERP']]
eth_bnb_avg = eth_bnb.mean(axis=0)
# This piece of code is equivalent to the projection equation above, with doing the centering step
proj = (eth_bnb - eth_bnb_avg) @ eigvecs[:,:-1] @ eigvecs[:,:-1].T + eth_bnb_avg.values
proj.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)
<AxesSubplot:xlabel='ETH-BUSD-PERP', ylabel='BNB-BUSD-PERP'>
How can we interpret what this line means? Well For each return of ETH of BNB in our data this is a visualisation of "how far" their respective returns have deviated from the expectation (ie. that principal component line). We can then introduce a simple rule that says, "I expect any deviation will revert exactly back to the expected value at the next interval". This simple rule will be our signal.
Ok I am going to switch back into using all three symbols of interest, rather than just ETH and BNB. What's great about mathematics is it allows us to take these intuitive geometric notions and generalise them to as many dimensions as we want.
One thing we have to be very careful of in trading is a concept known as "forward bias". This is equivalent to seeing the future, which unfortunately in a live trading system we cannot do.
So when we create our signal, we need to be careful to not use the future. How we can do this is by "rolling" our returns over a window of interest. We calculate the signal on this window and then use that signal at the next timestep.
Rolling our (now 3D) returns with a window of 2hrs looks like
window = 120 # 2hrs
rolled_returns = roll(returns_df, window)
We can display first window of 2hrs, denoted as $\text{step} = 0$
display_roll(step=0)
BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|
2022-01-01 00:02:00 | 0.001503 | 0.002058 | 0.001444 |
2022-01-01 00:03:00 | 0.000957 | 0.000284 | 0.000780 |
2022-01-01 00:04:00 | -0.000628 | -0.000720 | -0.000837 |
2022-01-01 00:05:00 | -0.000294 | -0.000062 | -0.000448 |
2022-01-01 00:06:00 | 0.002500 | 0.002331 | 0.001638 |
... | ... | ... | ... |
2022-01-01 01:57:00 | 0.000312 | 0.000150 | 0.000193 |
2022-01-01 01:58:00 | -0.000590 | -0.000414 | -0.001410 |
2022-01-01 01:59:00 | 0.000470 | 0.000497 | 0.000793 |
2022-01-01 02:00:00 | -0.000186 | -0.000099 | -0.000290 |
2022-01-01 02:01:00 | 0.000569 | 0.000827 | 0.001450 |
120 rows × 3 columns
Showing what $step = 1$ we can see we move forward in time by 1 minute
display_roll(step=1)
BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|
2022-01-01 00:03:00 | 0.000957 | 0.000284 | 0.000780 |
2022-01-01 00:04:00 | -0.000628 | -0.000720 | -0.000837 |
2022-01-01 00:05:00 | -0.000294 | -0.000062 | -0.000448 |
2022-01-01 00:06:00 | 0.002500 | 0.002331 | 0.001638 |
2022-01-01 00:07:00 | 0.001697 | 0.001736 | 0.000487 |
... | ... | ... | ... |
2022-01-01 01:58:00 | -0.000590 | -0.000414 | -0.001410 |
2022-01-01 01:59:00 | 0.000470 | 0.000497 | 0.000793 |
2022-01-01 02:00:00 | -0.000186 | -0.000099 | -0.000290 |
2022-01-01 02:01:00 | 0.000569 | 0.000827 | 0.001450 |
2022-01-01 02:02:00 | -0.000167 | -0.000236 | -0.000154 |
120 rows × 3 columns
Now that we know that we can a set of arrays that correpsond to different windows in time, we can apply our deviation from the principal component equation to our multi-dimensional array of returns. That is, we can now calculate a signal for all windows simultaneously.
covs, mus = calc_multidimensional_covariance(rolled_returns)
eigvals, eigvecs = np.linalg.eigh(covs)
V = eigvecs[:, :, :-1]
raw_signal = (rolled_returns - mus) @ V @ V.transpose(0, 2, 1) + mus
It's helpful to normalise our signal and scale by the standard deviation of the window. It's also best practice to clip the signal such that we don't end up with huge values that could cause misbehaving trades
signal = np.clip(raw_signal / raw_signal.std(axis = 1, keepdims = True, ddof = 1), -6, 6)
Now let's take the last row of each window, such that we end up with a signal per timestamp. We'll also make it a dataframe so it's a little nicer than just raw numpy arrays, take a look.
signal_df = pd.DataFrame(signal[:, -1, :], columns = returns_df.columns, index = returns_df.index[window - 1:])
signal_df
BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|
2022-01-01 02:01:00 | -1.345710 | -0.414521 | 2.824633 |
2022-01-01 02:02:00 | 0.545039 | 0.288037 | 0.342169 |
2022-01-01 02:03:00 | 0.177717 | 0.162890 | 0.720455 |
2022-01-01 02:04:00 | -0.041850 | 0.535374 | 0.670429 |
2022-01-01 02:05:00 | 0.891872 | 0.244282 | 0.223873 |
... | ... | ... | ... |
2022-11-05 23:56:00 | -0.436605 | 0.219841 | -0.420987 |
2022-11-05 23:57:00 | 0.736302 | 0.353268 | -0.774764 |
2022-11-05 23:58:00 | 0.613871 | 1.115094 | -1.356724 |
2022-11-05 23:59:00 | 3.152448 | -1.941574 | 0.438164 |
2022-11-06 00:00:00 | 0.569008 | 0.165273 | -0.634369 |
444840 rows × 3 columns
So a few things to note whilst looking at our signal dataframe.
Firstly, you'll notice the first timestamp in the frame corresponds to the final timesamp in our first roll window. This makes sense, it was only at this time (when the first 2hrs had passed) that we had enough data to calculate a signal. This is a good sanity check to ensure we haven't included any future information.
Secondly, you'll notice that our signal can be both positive and negative. This corresponds to either taking a long or short position in either asset (and hence why we look at perpetual futures on centralised exchanges!)
Finally, you'll notice that the numbers are small, so we need to think about how we scale these up to be dollar amounts we wish to trade.
There are many ways to skin a stat arb cat, but I'm just going to demonstrate something simple here. Remember back when we created the intuition about deviations from the first principal component, and that these assets would revert back. This means our trades need to be in the opposite direction to our signal (which measures just their deviation). So we'll need to include a minus sign.
We also want to scale our raw signals into something that we can make sense of. USD is generally what we're trying to gain so let's scale it so that every signal we bet 3000 USD. That is to say, it doesn't matter if we're long or short, we only want our dollars at risk to be 3000 USD. Hence, we need to scale with the absolute value of the signal.
dollars_at_risk = 3000
positions = -signal_df.multiply((dollars_at_risk / signal_df.abs().sum(1)),axis=0)
positions
BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|
2022-01-01 02:01:00 | 880.534455 | 271.232119 | -1848.233427 |
2022-01-01 02:02:00 | -1391.297166 | -735.261197 | -873.441637 |
2022-01-01 02:03:00 | -502.468153 | -460.548815 | -2036.983032 |
2022-01-01 02:04:00 | 100.628351 | -1287.314163 | -1612.057485 |
2022-01-01 02:05:00 | -1967.324465 | -538.847394 | -493.828142 |
... | ... | ... | ... |
2022-11-05 23:56:00 | 1215.682310 | -612.124143 | 1172.193547 |
2022-11-05 23:57:00 | -1184.822222 | -568.463161 | 1246.714618 |
2022-11-05 23:58:00 | -596.824256 | -1084.127807 | 1319.047937 |
2022-11-05 23:59:00 | -1709.513113 | 1052.878743 | -237.608144 |
2022-11-06 00:00:00 | -1247.231733 | -362.268146 | 1390.500121 |
444840 rows × 3 columns
So now we have a list of positions, or put another way the amount of USD we want to hold in each asset each minute. If you think about it carefully the trades are just the differences of each of these steps (with the noted omission of the first row, which stays the same, since before you start trading your position is zero).
trades = positions.diff().fillna(positions.iloc[0])
trades
BTC-BUSD-PERP | ETH-BUSD-PERP | BNB-BUSD-PERP | |
---|---|---|---|
2022-01-01 02:01:00 | 880.534455 | 271.232119 | -1848.233427 |
2022-01-01 02:02:00 | -2271.831620 | -1006.493316 | 974.791789 |
2022-01-01 02:03:00 | 888.829012 | 274.712382 | -1163.541395 |
2022-01-01 02:04:00 | 603.096505 | -826.765348 | 424.925547 |
2022-01-01 02:05:00 | -2067.952816 | 748.466770 | 1118.229344 |
... | ... | ... | ... |
2022-11-05 23:56:00 | 2382.024364 | -80.684424 | -130.024681 |
2022-11-05 23:57:00 | -2400.504532 | 43.660982 | 74.521071 |
2022-11-05 23:58:00 | 587.997965 | -515.664647 | 72.333319 |
2022-11-05 23:59:00 | -1112.688856 | 2137.006550 | -1556.656081 |
2022-11-06 00:00:00 | 462.281379 | -1415.146889 | 1628.108266 |
444840 rows × 3 columns
The trades are important when handing over this sort of signal to a live execution engine, but for the purposes of research here we'll just use the positions.
So now that we have our positions we can actually simulate how those positions would have performed under some very serious (and unfortunately unrealistic) assumptions.
To value our positions we want to multiply them by the returns in the following minute. I have a preference for moving the returns back in time by one step, so that I can see what the gross PnL of a particular signal was, you might prefer to move your signal forward in time for it to align with when you would have realised the gross PnL. You say tomato, I say tomato.
(positions*returns_df.shift(-1)).cumsum().plot(grid=True)
<AxesSubplot:>
As we can see lines go bottom left to top right, what every quant is looking for. Let's take a look at the portfolio as a whole and and compare it with buying and holding 1000 USD of BTC, ETH and BNB.
NB. For the quick minded reader, you'll see I cumulatively summed percentage returns. Very naughty you might say, I should have used log returns or cumulatively multiplied. I say bite me I'm a maverick, or more accurately, that I am just betting 3000 USD each day with no compounding, whichever answer you prefer.
strategy_pnl = (positions*returns_df.shift(-1)).sum(1)
hold_pnl = (1000 * returns_df).sum(1)
create_pnl_srs(strategy_pnl, 'Strat')
create_pnl_srs(hold_pnl, 'BnH')
Ok so we outperformed the market by quite a bit. In fact on \$ 3000 we made
strat_gross_pnl = strategy_pnl.sum()
strat_returns = 100 * strategy_pnl.sum() / dollars_at_risk
strat_sr = np.sqrt(260) * strategy_pnl.resample('B').sum().mean() / strategy_pnl.resample('B').sum().std()
print(f"Gross Pnl: ${strat_gross_pnl:.2f}\nReturns: {strat_returns:.2f}%\nSharpe Ratio: {strat_sr:.2f}")
Gross Pnl: $12319.90 Returns: 410.66% Sharpe Ratio: 9.73
When compared with buy and hold which did
bnh_gross_pnl = hold_pnl.sum()
bnh_returns = 100 * hold_pnl.sum() / dollars_at_risk
bnh_sr = np.sqrt(260) * hold_pnl.resample('B').sum().mean() / hold_pnl.resample('B').sum().std()
print(f"Gross Pnl: ${bnh_gross_pnl:.2f}\nReturns: {bnh_returns:.2f}%\nSharpe Ratio: {bnh_sr:.2f}")
Gross Pnl: $-1239.96 Returns: -41.33% Sharpe Ratio: -0.69
Not bad at all! I hear you clamour. Why didn't we start trading yesterday? I hear you shout. Well, there's a catch, there's always a catch.
So we've demonstrated a process which you might follow as an easy introduction to statistical arbitrage, but unfortunately it's flawed for a multitude of reasons:
But never fear dear reader, we can build upon this start. We can start to include more sophisticated signals, with multiple time horizons and mean-variance optimisation thrown in to boot. We can layer in other pieces of data and increase the universe of symbols and locations we trade. There is plenty more work to do here.