stat_arb

What is stat arb?¶

Paraphrasing investopedia.com, it defines statistical arbitrage (or stat arb) as a group of trading strategies that utilise mean reversion analyses to invest in portfolios of many securities for a very short period of time. Well what does this mean?

The key idea is mean reversion, that is to say that correlated assets will eventually return to their long run average. For example, say we believe that asset A and B are 100% correlated. So if asset A were to increase in price by 5% in one day, we would expect asset B to do the same. If this didn't happen, say asset B only appreciated by 2%, then we might expect there to be a price retracement in asset A. Given this assumption we can make automated trading decisions when we observe such a dislocation.

Let's dive into a bit of code to bring some life to this idea. Note: the code in this notebook won't run as I have redacted some inputs to protect IP.

Everything's correlated¶

First we need to get some libraries that are super handy for analysing data into scope.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We'll also want to define some symbols, an exchange and a date range we're interested in. For the purposes of this post I am going to look at the Binance's linear futures 1 minute mid price of BTC, ETH and BNB quoted in BUSD, over the year of 2022 so far.

symbols = ['BTC-BUSD-PERP', 'ETH-BUSD-PERP', 'BNB-BUSD-PERP']

perp_df = data_reader.read(symbols=symbols,
                     start_datetime=pd.Timestamp("2022-1-1"),
                     end_datetime=pd.Timestamp("2022-11-7"),
                     freq='60s',
                     features='mid',
                     exchanges='BINANCE_LINEAR');

We can take a look at the pandas dataframe which holds the price data

perp_df

	BTC-BUSD-PERP	ETH-BUSD-PERP	BNB-BUSD-PERP
2022-01-01 00:01:00	46244.25	3683.645	512.405
2022-01-01 00:02:00	46313.75	3691.225	513.145
2022-01-01 00:03:00	46358.05	3692.275	513.545
2022-01-01 00:04:00	46328.95	3689.615	513.115
2022-01-01 00:05:00	46315.35	3689.385	512.885
...	...	...	...
2022-11-05 23:56:00	21270.95	1626.095	348.895
2022-11-05 23:57:00	21278.05	1626.895	349.135
2022-11-05 23:58:00	21280.75	1627.455	349.085
2022-11-05 23:59:00	21293.55	1626.655	349.145
2022-11-06 00:00:00	21293.55	1626.405	348.965

444960 rows × 3 columns

and we can plot the daily prices to make sure they're reasonable.

fig, axs = plt.subplots(1, len(symbols), figsize=(15, 6))

for i, symbol in enumerate(symbols):
    perp_df[symbol].resample('D').last().plot(ax=axs[i], grid=True, title=symbol)

Looking at these price graphs alone already starts to raise our attention. Just with an eyeball you can see that the price action is very similar across these three assets; telling us they're highly correlated and "should" move together.

Since they're on different scales (and for some other nice mathematical properties) it's much more natural to use the returns of these price series as our starting point for investigation.

returns_df = perp_df.pct_change().dropna()

We can better see this idea of correlation if we plot the scatter plot of say ETH vs BNB.

returns_df.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)

<AxesSubplot:xlabel='ETH-BUSD-PERP', ylabel='BNB-BUSD-PERP'>

Visually we can see there is an obvious correlation between the two assets. When ETH returns are positive so are BNB returns and vice versa. We can even see there is a preferred direction along which they move. This preferred direction is called the first principal component.

Without getting too heavy into the mathematics we can work out this principal component with a litte bit of linear algebra.

Recall that the covariance matrix $\Sigma$ of centered (zero mean) matrix of returns $X$ is defined as

$$\Sigma = \mathbb{E}(\mathbf{X}^T \mathbf{X})$$

where $\mathbb{E}$ is the expectation operator.

Once, we have this covariance matrix $\Sigma$ we can extract the component by taking its eigendecomposition

$$\Sigma = V \Lambda V^T$$

where we have $V$ the matrix of eigenvectors and $\Lambda$ the diagonal matrix of eigenvalues. The eigenvector assosciated with the largest eigenvalue is our first principal component. We can demonstrate this visually

cov = returns_df[['ETH-BUSD-PERP', 'BNB-BUSD-PERP']].cov()
eigvals, eigvecs = np.linalg.eigh(cov)

returns_df.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)
plt.quiver(0, 0, -eigvecs[1,0], -eigvecs[1,1],scale=3)

<matplotlib.quiver.Quiver at 0x7fc1a31bd580>

So as we can see our principal component points nicely along the direction we would expect. In mathematical speak, we could say that this direction explains the largest amount of variance we see in the returns of ETH and BNB. In fact we could go as far as to say this direction is the line we expect returns to fall on, and if they don't there's been a deviation. So how can we use this information to make a trading strategy?

From correlations to strategies¶

So we know how to extract our principal component, and we're also armed with the assumption that perhaps when asset returns deviate from this component they could revert, but how can we proceed? Well it's back to our trusted friend linear algebra.

We can project our returns onto this component and measure their deviation. For a centered returns matrix $\mathbf{X}$ and eigenvector $V$ this is equivalent to doing

$$P = \mathbf{X} V V^T$$

which gives us something that looks like

eth_bnb = returns_df[['ETH-BUSD-PERP', 'BNB-BUSD-PERP']]
eth_bnb_avg = eth_bnb.mean(axis=0)

# This piece of code is equivalent to the projection equation above, with doing the centering step
proj = (eth_bnb - eth_bnb_avg) @ eigvecs[:,:-1] @ eigvecs[:,:-1].T + eth_bnb_avg.values

proj.plot.scatter(x='ETH-BUSD-PERP', y='BNB-BUSD-PERP', grid=True)

<AxesSubplot:xlabel='ETH-BUSD-PERP', ylabel='BNB-BUSD-PERP'>

How can we interpret what this line means? Well For each return of ETH of BNB in our data this is a visualisation of "how far" their respective returns have deviated from the expectation (ie. that principal component line). We can then introduce a simple rule that says, "I expect any deviation will revert exactly back to the expected value at the next interval". This simple rule will be our signal.

And let there be signals¶

Ok I am going to switch back into using all three symbols of interest, rather than just ETH and BNB. What's great about mathematics is it allows us to take these intuitive geometric notions and generalise them to as many dimensions as we want.

One thing we have to be very careful of in trading is a concept known as "forward bias". This is equivalent to seeing the future, which unfortunately in a live trading system we cannot do.

So when we create our signal, we need to be careful to not use the future. How we can do this is by "rolling" our returns over a window of interest. We calculate the signal on this window and then use that signal at the next timestep.

Rolling our (now 3D) returns with a window of 2hrs looks like

window = 120 # 2hrs
rolled_returns = roll(returns_df, window)

We can display first window of 2hrs, denoted as $\text{step} = 0$

display_roll(step=0)

	BTC-BUSD-PERP	ETH-BUSD-PERP	BNB-BUSD-PERP
2022-01-01 00:02:00	0.001503	0.002058	0.001444
2022-01-01 00:03:00	0.000957	0.000284	0.000780
2022-01-01 00:04:00	-0.000628	-0.000720	-0.000837
2022-01-01 00:05:00	-0.000294	-0.000062	-0.000448
2022-01-01 00:06:00	0.002500	0.002331	0.001638
...	...	...	...
2022-01-01 01:57:00	0.000312	0.000150	0.000193
2022-01-01 01:58:00	-0.000590	-0.000414	-0.001410
2022-01-01 01:59:00	0.000470	0.000497	0.000793
2022-01-01 02:00:00	-0.000186	-0.000099	-0.000290
2022-01-01 02:01:00	0.000569	0.000827	0.001450

120 rows × 3 columns

Showing what $step = 1$ we can see we move forward in time by 1 minute

display_roll(step=1)

	BTC-BUSD-PERP	ETH-BUSD-PERP	BNB-BUSD-PERP
2022-01-01 00:03:00	0.000957	0.000284	0.000780
2022-01-01 00:04:00	-0.000628	-0.000720	-0.000837
2022-01-01 00:05:00	-0.000294	-0.000062	-0.000448
2022-01-01 00:06:00	0.002500	0.002331	0.001638
2022-01-01 00:07:00	0.001697	0.001736	0.000487
...	...	...	...
2022-01-01 01:58:00	-0.000590	-0.000414	-0.001410
2022-01-01 01:59:00	0.000470	0.000497	0.000793
2022-01-01 02:00:00	-0.000186	-0.000099	-0.000290
2022-01-01 02:01:00	0.000569	0.000827	0.001450
2022-01-01 02:02:00	-0.000167	-0.000236	-0.000154

120 rows × 3 columns

Now that we know that we can a set of arrays that correpsond to different windows in time, we can apply our deviation from the principal component equation to our multi-dimensional array of returns. That is, we can now calculate a signal for all windows simultaneously.

covs, mus = calc_multidimensional_covariance(rolled_returns)

eigvals, eigvecs = np.linalg.eigh(covs)

V = eigvecs[:, :, :-1]

raw_signal = (rolled_returns - mus) @ V @ V.transpose(0, 2, 1) + mus

It's helpful to normalise our signal and scale by the standard deviation of the window. It's also best practice to clip the signal such that we don't end up with huge values that could cause misbehaving trades

signal = np.clip(raw_signal / raw_signal.std(axis = 1, keepdims = True, ddof = 1), -6, 6)

Now let's take the last row of each window, such that we end up with a signal per timestamp. We'll also make it a dataframe so it's a little nicer than just raw numpy arrays, take a look.

signal_df = pd.DataFrame(signal[:, -1, :], columns = returns_df.columns, index = returns_df.index[window - 1:])

signal_df

	BTC-BUSD-PERP	ETH-BUSD-PERP	BNB-BUSD-PERP
2022-01-01 02:01:00	-1.345710	-0.414521	2.824633
2022-01-01 02:02:00	0.545039	0.288037	0.342169
2022-01-01 02:03:00	0.177717	0.162890	0.720455
2022-01-01 02:04:00	-0.041850	0.535374	0.670429
2022-01-01 02:05:00	0.891872	0.244282	0.223873
...	...	...	...
2022-11-05 23:56:00	-0.436605	0.219841	-0.420987
2022-11-05 23:57:00	0.736302	0.353268	-0.774764
2022-11-05 23:58:00	0.613871	1.115094	-1.356724
2022-11-05 23:59:00	3.152448	-1.941574	0.438164
2022-11-06 00:00:00	0.569008	0.165273	-0.634369

444840 rows × 3 columns

So a few things to note whilst looking at our signal dataframe.

Firstly, you'll notice the first timestamp in the frame corresponds to the final timesamp in our first roll window. This makes sense, it was only at this time (when the first 2hrs had passed) that we had enough data to calculate a signal. This is a good sanity check to ensure we haven't included any future information.

Secondly, you'll notice that our signal can be both positive and negative. This corresponds to either taking a long or short position in either asset (and hence why we look at perpetual futures on centralised exchanges!)

Finally, you'll notice that the numbers are small, so we need to think about how we scale these up to be dollar amounts we wish to trade.

Trades¶

There are many ways to skin a stat arb cat, but I'm just going to demonstrate something simple here. Remember back when we created the intuition about deviations from the first principal component, and that these assets would revert back. This means our trades need to be in the opposite direction to our signal (which measures just their deviation). So we'll need to include a minus sign.

We also want to scale our raw signals into something that we can make sense of. USD is generally what we're trying to gain so let's scale it so that every signal we bet 3000 USD. That is to say, it doesn't matter if we're long or short, we only want our dollars at risk to be 3000 USD. Hence, we need to scale with the absolute value of the signal.

dollars_at_risk = 3000

positions = -signal_df.multiply((dollars_at_risk / signal_df.abs().sum(1)),axis=0)

positions

	BTC-BUSD-PERP	ETH-BUSD-PERP	BNB-BUSD-PERP
2022-01-01 02:01:00	880.534455	271.232119	-1848.233427
2022-01-01 02:02:00	-1391.297166	-735.261197	-873.441637
2022-01-01 02:03:00	-502.468153	-460.548815	-2036.983032
2022-01-01 02:04:00	100.628351	-1287.314163	-1612.057485
2022-01-01 02:05:00	-1967.324465	-538.847394	-493.828142
...	...	...	...
2022-11-05 23:56:00	1215.682310	-612.124143	1172.193547
2022-11-05 23:57:00	-1184.822222	-568.463161	1246.714618
2022-11-05 23:58:00	-596.824256	-1084.127807	1319.047937
2022-11-05 23:59:00	-1709.513113	1052.878743	-237.608144
2022-11-06 00:00:00	-1247.231733	-362.268146	1390.500121

444840 rows × 3 columns

So now we have a list of positions, or put another way the amount of USD we want to hold in each asset each minute. If you think about it carefully the trades are just the differences of each of these steps (with the noted omission of the first row, which stays the same, since before you start trading your position is zero).

trades = positions.diff().fillna(positions.iloc[0])

trades

	BTC-BUSD-PERP	ETH-BUSD-PERP	BNB-BUSD-PERP
2022-01-01 02:01:00	880.534455	271.232119	-1848.233427
2022-01-01 02:02:00	-2271.831620	-1006.493316	974.791789
2022-01-01 02:03:00	888.829012	274.712382	-1163.541395
2022-01-01 02:04:00	603.096505	-826.765348	424.925547
2022-01-01 02:05:00	-2067.952816	748.466770	1118.229344
...	...	...	...
2022-11-05 23:56:00	2382.024364	-80.684424	-130.024681
2022-11-05 23:57:00	-2400.504532	43.660982	74.521071
2022-11-05 23:58:00	587.997965	-515.664647	72.333319
2022-11-05 23:59:00	-1112.688856	2137.006550	-1556.656081
2022-11-06 00:00:00	462.281379	-1415.146889	1628.108266

444840 rows × 3 columns

The trades are important when handing over this sort of signal to a live execution engine, but for the purposes of research here we'll just use the positions.

Backtest or die¶

So now that we have our positions we can actually simulate how those positions would have performed under some very serious (and unfortunately unrealistic) assumptions.

To value our positions we want to multiply them by the returns in the following minute. I have a preference for moving the returns back in time by one step, so that I can see what the gross PnL of a particular signal was, you might prefer to move your signal forward in time for it to align with when you would have realised the gross PnL. You say tomato, I say tomato.

(positions*returns_df.shift(-1)).cumsum().plot(grid=True)

<AxesSubplot:>

As we can see lines go bottom left to top right, what every quant is looking for. Let's take a look at the portfolio as a whole and and compare it with buying and holding 1000 USD of BTC, ETH and BNB.

NB. For the quick minded reader, you'll see I cumulatively summed percentage returns. Very naughty you might say, I should have used log returns or cumulatively multiplied. I say bite me I'm a maverick, or more accurately, that I am just betting 3000 USD each day with no compounding, whichever answer you prefer.

strategy_pnl = (positions*returns_df.shift(-1)).sum(1)
hold_pnl = (1000 * returns_df).sum(1)

create_pnl_srs(strategy_pnl, 'Strat')
create_pnl_srs(hold_pnl, 'BnH')

Ok so we outperformed the market by quite a bit. In fact on \$ 3000 we made

strat_gross_pnl = strategy_pnl.sum()
strat_returns = 100 * strategy_pnl.sum() / dollars_at_risk
strat_sr = np.sqrt(260) * strategy_pnl.resample('B').sum().mean() / strategy_pnl.resample('B').sum().std()

print(f"Gross Pnl: ${strat_gross_pnl:.2f}\nReturns: {strat_returns:.2f}%\nSharpe Ratio: {strat_sr:.2f}")

Gross Pnl: $12319.90
Returns: 410.66%
Sharpe Ratio: 9.73

When compared with buy and hold which did

bnh_gross_pnl = hold_pnl.sum()
bnh_returns = 100 * hold_pnl.sum() / dollars_at_risk
bnh_sr = np.sqrt(260) * hold_pnl.resample('B').sum().mean() / hold_pnl.resample('B').sum().std()

print(f"Gross Pnl: ${bnh_gross_pnl:.2f}\nReturns: {bnh_returns:.2f}%\nSharpe Ratio: {bnh_sr:.2f}")

Gross Pnl: $-1239.96
Returns: -41.33%
Sharpe Ratio: -0.69

Not bad at all! I hear you clamour. Why didn't we start trading yesterday? I hear you shout. Well, there's a catch, there's always a catch.

The Catch¶

So we've demonstrated a process which you might follow as an easy introduction to statistical arbitrage, but unfortunately it's flawed for a multitude of reasons:

Fees, slippage and spread. We didn't include any notion of costs in this strategy, which are always of order 1 importance. You'll find if you analyse the bps per trade, the edge this strategy has is too small to beat them.
Instant fill. We assumed that we immediately got all our position on instanteously to capture the minute's whole return. This just isn't realistic and we need to include some notion of probabilistic fill time.
Out of sample testing. Just because we had excellent results this year, we don't know if this strategy will hold up in time. We could have easily overfit to the data we have, so we should have following a proper train, validate, test process.
Too simplistic. At the heart of it, this strategy is too simple. Thousands of others will and have tried it and that means that any alpha it has will be long since decayed away

But never fear dear reader, we can build upon this start. We can start to include more sophisticated signals, with multiple time horizons and mean-variance optimisation thrown in to boot. We can layer in other pieces of data and increase the universe of symbols and locations we trade. There is plenty more work to do here.