Building A Pairs-Trading Strategy With Python From Scratch

Identifying Profitable Stock Pairs Using the Distance Method

In partnership with

`

Want Real Insights on the Future of Finance? Start Here.

The rise of digital assets and evolving global regulations raise a critical question—are we witnessing a financial revolution or a market reckoning? Staying informed has never been more important. That’s why we read The Daily Upside.

Founded by career journalists, investment bankers, and finance professionals, The Daily Upside delivers exclusive news, in-depth analysis, and expert commentary on the forces shaping the world economy. Join over 1 million readers and subscribe for free today.

Exciting News: Paid Subscriptions Have Launched! 🚀

On September 1, we officially rolled out our new paid subscription plans at GuruFinance Insights, offering you the chance to take your investing journey to the next level! Whether you're just starting or are a seasoned trader, these plans are packed with exclusive trading strategies, in-depth research paper analysis, ad-free content, monthly AMAsessions, coding tutorials for automating trading strategies, and much more.

Our three tailored plans—Starter Investor, Pro Trader, and Elite Investor—provide a range of valuable tools and personalized support to suit different needs and goals. Don’t miss this opportunity to get real-time trade alerts, access to masterclasses, one-on-one strategy consultations, and be part of our private community group. Click here to explore the plans and see how becoming a premium member can elevate your investment strategy!

Check Out Latest Premium Articles

In our previous articles, we explored the construction of a pairs trading strategy for cryptocurrencies using the distance method. This time, we’ll turn our attention to one of the largest and most established financial markets — the U.S. stock market. Specifically, we will apply the same distance method to S&P 500 stocks. With historical data spanning over two decades, we’ll be able to perform a more reliable backtest, gaining deeper insights into the strategy’s performance.

This tech company grew 32,481%..

No, it's not Nvidia. It's Mode Mobile, 2023’s fastest-growing software company according to Deloitte.

They’ve just been granted their Nasdaq stock ticker, and you can still invest at just $0.26/share.

*Mode Mobile recently received their ticker reservation with Nasdaq ($MODE), indicating an intent to IPO in the next 24 months. An intent to IPO is no guarantee that an actual IPO will occur.
*The Deloitte rankings are based on submitted applications and public company database research, with winners selected based on their fiscal-year revenue growth percentage over a three-year period.
*Please read the offering circular and related risks at invest.modemobile.com.

Importing Libraries and Loading Data
 To get started, we first need to load the necessary libraries and acquire historical stock price data through the Yahoo Finance API.

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
from itertools import combinations
from datetime import datetime, timedelta
import yfinance as yf

# Number of pairs in a portfolio during a period
PORTFOLIO_SIZE = 20

# Load data
data = pd.read_excel("sp500.xlsx", sheet_name="updated")
ind = data[data['date'] == '2000-01-03'].index[0]
data = data.iloc[ind:]
data['date'] = pd.to_datetime(data['date']).apply(lambda x: x.strftime('%Y-%m-%d'))

# Create tickers dictionary
tickers = {data['date'][i]: data['tickers'][i].split(",") for i in range(ind, len(data)+ind)}

# Collect all tickers
all_tickers = set()
for t in tickers.values():
    all_tickers.update(t)
all_tickers = list(all_tickers)
tickers_to_remove = ['TIE', 'BSC', 'BDK', 'CBE', 'ACS', 'MEE', 'BOL']
filtered_tickers = [ticker for ticker in all_tickers if ticker not in tickers_to_remove]

Some tickers in the dataset may have data quality issues. Therefore, we filter them out to ensure cleaner inputs for the strategy. With the tickers cleaned, we proceed to download the historical price data.

historical_data = yf.download(filtered_tickers, start="1999-01-01", end="2024-04-01", group_by='ticker')
adj_close_data = historical_data.xs('Adj Close', level=1, axis=1)
# Drop columns where all values are NaN
close_prices = adj_close_data.dropna(axis=1, how='all')

Building the Strategy

The strategy leverages daily stock price data from 1999 through March 2024. For each period, we compute the SSD (Sum of Squared Differences) over a one-year lookback window, identifying the top 20 most similar pairs. These pairs are then traded over a six-month horizon. We open positions based on specific Z-score thresholds: pairs are bought or sold when the Z-score crosses ±2, and the positions are closed once the Z-score reverts to 0.

The implementation remains similar to the cryptocurrency version we discussed previously, but let’s review each component for clarity.

First, we normalize the price data and calculate SSD using the following functions:

def normalize(df, min_vals, max_vals):
    return (df - min_vals) / (max_vals - min_vals)

def calculate_ssd(df):
    filtered_df = df.dropna(axis=1)
    return {f'{c1}-{c2}': np.sum((filtered_df[c1] - df[c2]) ** 2) for c1, c2 in combinations(filtered_df.columns, 2)}

def top_x_pairs(df, start, end):
    ssd_results_dict = calculate_ssd(df)
    sorted_ssd_dict = dict(sorted(ssd_results_dict.items(), key=lambda item: item[1]))
    most_similar_pairs = {}
    coins = set()
    for pair, ssd in sorted_ssd_dict.items():
        coin1, coin2 = pair.split('-')
        if coin1 not in coins and coin2 not in coins:
            most_similar_pairs[coin1] = (pair, ssd)
            coins.add(coin1)
            coins.add(coin2)
            if len(most_similar_pairs) == PORTFOLIO_SIZE:
                break
    sorted_ssd = dict(sorted(most_similar_pairs.items(), key=lambda item: item[1][1]))
    topx_pairs = list(sorted_ssd.values())[:PORTFOLIO_SIZE]
    return topx_pairs

We set PORTFOLIO_SIZE to 20, selecting the top 20 pairs with the smallest SSD metric during each period. A few additional utility functions support date-based operations:

def get_previous_date(dates_list, target_date_str):
    dates = [datetime.strptime(date, '%Y-%m-%d') for date in dates_list]
    target_date = datetime.strptime(target_date_str, '%Y-%m-%d')
    dates.sort()
    previous_date = None
    for date in dates:
        if date >= target_date:
            break
        previous_date = date
    return previous_date.strftime('%Y-%m-%d') if previous_date else None

def one_day_after(date_str):
    date_format = "%Y-%m-%d"
    date_obj = datetime.strptime(date_str, date_format)
    return (date_obj + timedelta(days=1)).strftime(date_format)

def one_year_before(date_str):
    date_format = "%Y-%m-%d"
    original_date = datetime.strptime(date_str, date_format)
    try:
        return original_date.replace(year=original_date.year - 1).strftime(date_format)
    except ValueError:
        return original_date.replace(month=2, day=28, year=original_date.year - 1).strftime(date_format)

Strategy Returns

We calculate the strategy return over each holding period:

def strategy_return(data, commission=0.001):
    pnl = 0
    for df in data.values():
        # Handle long positions
        long_entries = df[df['buy'] == 1].index
        for idx in long_entries:
            exit_idx = df[(df.index > idx) & (df['long_exit'])].index
            # Position details omitted here for clarity.
    return pnl / len(data)

We apply additional filtering to exclude low-liquidity stocks:

def filter_stocks(date):
    nearest_date = get_previous_date(dates_list, date)
    stock_list = tickers[nearest_date]
    formation_start_date = one_year_before(date)
    stocks_data = historical_data.loc[formation_start_date:date]
    # Remove stocks with missing data or low liquidity.
    return filtered_stocks

Results

The strategy achieved an annualized return of 3.09%, which is lower compared to cryptocurrency markets. However, the maximum drawdown is a manageable 5.66%, reflecting limited downside risk. Performance was strong until 2014 but weakened thereafter, likely due to heightened competition or market structure changes.