A Transformer-Based Alternative for Identifying Competitors

In partnership with

Learn Business Buying & Scaling In 3 Days

NOVEMBER 2-4 | AUSTIN, TX

“Almost no one in the history of the Forbes list has gotten there with a salary. You get rich by owning things.” –Sam Altman

At Main Street Over Wall Street 2025, you’ll learn the exact playbook we’ve used to help thousands of “normal” people find, fund, negotiate, and buy profitable businesses that cash flow.

Tactical business buying training and clarity
Relationships with business owners, investors, and skilled operators
Billionaire mental frameworks for unlocking capital and taking calculated risk
The best event parties you’ve ever been to

Use code BHP500 to save $500 on your ticket today (this event WILL sell out).

Click here to get your ticket, see the speaker list, schedule, and more.

🚀 Your Investing Journey Just Got Better: Premium Subscriptions Are Here! 🚀

It’s been 4 months since we launched our premium subscription plans at GuruFinance Insights, and the results have been phenomenal! Now, we’re making it even better for you to take your investing game to the next level. Whether you’re just starting out or you’re a seasoned trader, our updated plans are designed to give you the tools, insights, and support you need to succeed.

Here’s what you’ll get as a premium member:

Exclusive Trading Strategies: Unlock proven methods to maximize your returns.
In-Depth Research Analysis: Stay ahead with insights from the latest market trends.
Ad-Free Experience: Focus on what matters most—your investments.
Monthly AMA Sessions: Get your questions answered by top industry experts.
Coding Tutorials: Learn how to automate your trading strategies like a pro.
Masterclasses & One-on-One Consultations: Elevate your skills with personalized guidance.

Our three tailored plans—Starter Investor, Pro Trader, and Elite Investor—are designed to fit your unique needs and goals. Whether you’re looking for foundational tools or advanced strategies, we’ve got you covered.

Don’t wait any longer to transform your investment strategy. The last 4 months have shown just how powerful these tools can be—now it’s your turn to experience the difference.

Do Option Markets Accurately Reflect The Probabilities Of Underlying Asset Movement?

ayratmurtazin.beehiiv.com/p/do-option-markets-accurately-reflect-the-probabilities-of-underlying-asset-movement

👉 Explore Premium Plans Now

Disclaimer: This article is for informational and educational purposes only. It does not constitute financial advice, investment recommendations, or an endorsement of any specific strategy, security, or financial product. The content is intended to explore data science and machine learning techniques in a research context. Readers are solely responsible for any decisions based on this information. They are strongly encouraged to conduct their own due diligence and consult with a licensed financial advisor before making any investment decisions.

Data sourced from Financial Modeling Prep (FMP).

Introduction

Competitive analysis is a crucial step one must take when investing in stocks.

With that being said, how do you identify a company’s competitors? Traditionally, one would probably filter for companies in the same sector or industry.

If you were to ask me, I would say that this is like painting with a broad brush.

Using the current constituents in the S&P 500, let’s look at a breakdown of the sectors and how many companies are in them:

Image provided by the author

If we assume each sector-based competitor requires a thorough analysis, the above list is quite large for each sector.

Not only that, are companies in the same sector competitors? Let’s look at an example.

Let’s say you want to invest in AAPL (Apple), and based on its SEC profile, it is in the Technology sector.

Tax automation that clears your desk and your mind

Tax leaders are juggling shifting regulations, growing complexity, and leaner teams. In this on-demand webinar, discover how automation reduces manual work, increases accuracy, and frees your team to focus on strategy instead of spreadsheets. You will see how Longview Tax helps you streamline compliance, gain real-time insights, and turn your department into a true strategic asset.

Watch Now

Traditionally, Apple has been known for its iPhones, iPads, and MacBooks, but it now offers a wide range of products, including watches, accessories, Apple TV, Apple Pay, and more.

Theoretically, a true competitor would have very similar lines of business; however, if you dive into the list of Technology companies, you will find lines of business that include:

Chip makers
Software providers
Cloud services

In addition to that very short list that could be much longer, there could be numerous subgroups; for example, one company could provide video chat software, while another offers database services. Considering potential competitors of Apple, some may align more closely, while others will have virtually no similarities.

So, how do we find the most relevant competitors?

Luckily, publicly traded companies are required to file numerous reports with the SEC periodically, and these reports include a formal description.

Within this description. Let’s look at Apple’s latest company description as an example:

❝

“Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, and HomePod. It also provides AppleCare support and cloud services, and operates various platforms, including the App Store, which allows customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. Additionally, it offers advertising services, including third-party licensing arrangements and its own advertising platforms. In addition, the company offers various subscription-based services, such as Apple Arcade, a game subscription service; Apple Fitness+, a personalized fitness service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV+, which offers exclusive original content; Apple Card, a co-branded credit card; and Apple Pay, a cashless payment service, as well as licenses its intellectual property. The company serves consumers, small and mid-sized businesses, and the education, enterprise, and government markets. It distributes third-party applications for its products through the App Store. The company also sells its products through its retail and online stores, a direct sales force, and third-party cellular network carriers, wholesalers, retailers, and resellers. Apple Inc. was founded in 1976 and is headquartered in Cupertino, California.”

There is a lot of valuable and detailed information that can be found here, such as:

The lines of business
The customers
How business is conducted

Although it may be a valuable exercise, reading a company’s formal SEC description and all of its potential competitors may not be feasible, nor would it be scalable.

Luckily, we can leverage some of the same techniques found in Generative AI Tools and data from various endpoints of the Financial Modeling Prep (FMP) API to utilize company descriptions and find other similar companies, even quantifying this similarity with a metric.

We will accomplish this using Sentence Transformers.

What are Sentence Transformers?

Sentence Transformers are models that one can leverage to convert a sentence or document into a numerical representation, more specifically, a vector.

Once a document is converted to a vector, numerous options become available for its use.

In the context of this project, we will utilize the vectors for calculating Semantic Text Similarity.

Semantic Text Similarity (STS) is a technique that detects similarity between two texts by going beyond similar words or counts to extract the actual meaning of what each text conveys.

In other words, it attempts to assign values based on the various possible messages that a document or sentence is conveying.

This context is captured by way of a self-attention mechanism.

This is highly useful for search and recommendation systems, chat-bots, and topic modeling.

This makes it a perfect tool to leverage for the use case we will address.

Libraries & Data

Note that I built this project in Google Colab, so I added my FMP API key as a secret.

You can find the Secrets section here:

Image provided by the authro

For the data, I built three functions that retrieve the current companies in the S&P 500 index, their SEC profile descriptions, and their SEC market sectors.

I will discuss the other libraries once we get to the steps where they are utilized.

import requests
import numpy as np
import pandas as pd
import re
from tqdm import tqdm
import spacy
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab')

import torch
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

from google.colab import userdata
API_KEY = userdata.get('FMPAPIKEY')

def sp500_constituent_symbols(API_KEY):
    response = requests.get(f"https://financialmodelingprep.com/stable/sp500-constituent?apikey={API_KEY}")
    constituents_df = pd.DataFrame(response.json())
    symbols = constituents_df['symbol'].tolist()
    return symbols

def get_SEC_profile(SYMBOL, API_KEY):
    response = requests.get(f"https://financialmodelingprep.com/stable/sec-profile?symbol={SYMBOL}&apikey={API_KEY}")
    profile_df = pd.DataFrame(response.json())
    return profile_df['description'].values[0]

def get_SEC_sector(SYMBOL, API_KEY):
    response = requests.get(f"https://financialmodelingprep.com/stable/sec-profile?symbol={SYMBOL}&apikey={API_KEY}")
    profile_df = pd.DataFrame(response.json())
    return profile_df['marketSector'].values[0]

Text Preprocessing

If you have worked on at least a handful of Natural Language Processing (NLP) projects, then the below might seem like a small amount of preprocessing.

This text preprocessing workflow removes common phrases found in most SEC descriptions and isolates key sentences that explain the business's activities.

100 Genius Side Hustle Ideas

Don't wait. Sign up for The Hustle to unlock our side hustle database. Unlike generic "start a blog" advice, we've curated 100 actual business ideas with real earning potential, startup costs, and time requirements. Join 1.5M professionals getting smarter about business daily and launch your next money-making venture.

Get the guide

Why not remove stop-words, convert words to n-grams, etc?

This is because a sentence transformer model is very dependent on sparse word relationships due to its self-attention mechanism.

This enables the model to capture relationships between all of the words in a sentence, no matter how far apart they are.

One last note: importing all the data below may take some time, depending on your computing power.

def remove_boilerplate(text):
    patterns = [
        r'was (incorporated|founded).*?\.',
        r'headquartered in .*?\.',
        r'formerly known as .*?\.',
        r'doing business as .*?\.',
    ]
    for p in patterns:
        text = re.sub(p, '', text, flags=re.IGNORECASE)
    return text

def filter_operational_sentences(text):
    sentences = sent_tokenize(text)
    keepers = [s for s in sentences if any(k in s.lower() for k in [
        'segment', 'provides', 'offers', 'services', 'markets', 'customers', 'solutions', 'operates', 'sells', 'commercial'])]
    return ' '.join(keepers)

def truncate_text(text, max_tokens=512):
    tokens = text.split()
    return ' '.join(tokens[:max_tokens])

def preprocess_description(text):
    text = remove_boilerplate(text)
    text = filter_operational_sentences(text)
    text = truncate_text(text)
    return text

symbols = sp500_constituent_symbols(API_KEY)
symbols = symbols[:25]

company_descriptions = {}
for symbol in tqdm(symbols, desc="Processing companies"):
    try:
        raw_desc = get_SEC_profile(symbol, API_KEY)
        if raw_desc:
            processed = preprocess_description(raw_desc)
            company_descriptions[symbol] = processed
    except Exception as e:
        print(f"Error processing {symbol}: {e}")

sectors = {}
for symbol in tqdm(symbols, desc="Processing companies"):
    try:
        desc = get_SEC_sector(symbol, API_KEY)
        sectors[symbol] = desc
    except Exception as e:
        print(f"Error processing {symbol}: {e}")

df = pd.DataFrame({
    'Sector': sectors,
    'Description': company_descriptions
})

df.reset_index(inplace=True)

df.columns = ['symbol', 'Sector', 'Description']
df.head()

Fitting the Model

Technically, we aren’t fitting a model; rather, we are using a pre-trained transformer model to encode the company descriptions.

Once they are encoded, we can utilize a similarity ranking technique to quantify how similar one description is to another.

The model we will use is called all-roberta-large-v1.

all-roberta-large-v1

This model is based on the RoBERTa large model from Facebook AI.

Readers interested in the original paper can check it out here: https://arxiv.org/abs/1907.11692. It is a robustly optimized version of the BERT model from Google.

At the time of its introduction in 2018, BERT was a state-of-the-art language model that now serves as a foundation for the large language models that have been developed over recent years.

It is famous for its bi-directional nature, as it was the first model to capture the contextual meaning of words by looking at what comes before and after a given word in a document or sentence.

RoBERTa differs from BERT in several key aspects, including but not limited to: training with a much larger corpus (collection of documents), not including next-sentence prediction, and dynamic masking with masked language modeling.

There is a lot more we could go into with this subject, but the main concept to understand is sentence transformer version of RoBERTa is ideal for semantic search tasks.

model = SentenceTransformer('all-roberta-large-v1')
df['embedding'] = df['Description'].apply(lambda x: model.encode(x, convert_to_tensor=True))

Let’s take a closer look at the newly added embedding column.

Technically, it is of type torch.Tensor, which is the PyTorch version of an array.

What makes this different from a numpy or a traditional Python list is that it is optimized for GPU operations and learning tasks.

Behind the scenes, each token passes through the model and gets encoded as a 1024-dimensional vector.

This is a key concept because when a token passes through the model, the vector for it will look different for each sentence or document.

These embeddings are then averaged across the individual tokens for each document, or in our case, each SEC description.

This technique is known as mean-pooling.

So, what exactly are these embeddings?

Each embedding is made up of 1024 units.

Think of these individual units as abstract patterns that do not have individual meaning, but collectively, they represent some type of semantic relationship.

first_embedding = df['embedding'][0]
print(first_embedding)
print("Shape: ", first_embedding.shape)

Image provided by the author

Find Competitors with Cosine Similarity

At this point, we have a dataset of companies, their sectors, SEC descriptions, and a 1024-dimensional vector that numerically represents the description’s semantic meaning.

Using a technique known as Cosine Similarity, we can quantify how similar two companies are based on their description.

The cosine similarity scores will range from negative one to one, with one meaning the vectors are identical, zero indicating that two vectors are orthogonal, and negative one meaning the two vectors are going in completely opposite directions.

Note that in the context of sentence embeddings, these values will almost always range from zero to one.

So, how should one interpret the scores in the context of text similarity?

I put together a simple guide:

close to one: The texts are almost identical or have a very similar meaning
close to zero: The texts have virtually no similarities
close to negative one: The texts contradict each other, but this is extremely rare in practice.

def get_top_n_similar(symbol, df, top_n=5):
    if symbol not in df['symbol'].values:
        raise ValueError("Symbol not found in dataframe.")

    target_embedding = df[df['symbol'] == symbol]['embedding'].values[0]
    target_embedding = target_embedding.cpu().numpy()

    all_embeddings = torch.stack(df['embedding'].tolist())
    all_embeddings = all_embeddings.cpu().numpy()

    similarities = cosine_similarity([target_embedding], all_embeddings)[0]

    df = df.copy()
    df['similarity'] = similarities
    result = df[df['symbol'] != symbol].sort_values(by='similarity', ascending=False).head(top_n)
    
    return result[['symbol', 'similarity', 'Sector']]

Let’s find some competitors! We will start with Apple.

Again, I should clarify that this is solely based on how semantically similar the company SEC descriptions are, and they are also included in the S&P 500.

Amazon took the top spot with a cosine similarity score of 0.58.

At first glance, I was a bit shocked by this result, but as I compared both descriptions, I noticed some interesting patterns.

Let’s look at a few excerpts I have chosen from each document that I believe contributed to this score

Apple: “Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide.”

Amazon: “The company also manufactures and sells electronic devices, including Kindle, Fire tablets, Fire TVs, Rings, Blink, eero, and Echo; and develops and produces media content.”

Apple: “It also provides AppleCare support and cloud services; and operates various platforms, including the App Store that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts, as well as advertising services include third-party licensing arrangements and its own advertising platforms.”

Amazon: “Further, the company provides compute, storage, database, analytics, machine learning, and other services, as well as fulfillment, advertising, and digital content subscriptions.”

It appears that the similarity engine captured some interesting relationships among the devices both companies sell and their online/cloud platforms.

One may also argue that Apple TV and Amazon Prime Video contributed to this high score.

Again, we can only speculate due to how sparse the patterns can be.

get_top_n_similar(symbol = 'AAPL',df = df)

Image provided by the author

Let’s take a look at another company that has been making waves in the financial world lately, Palantir.

The most similar company based on our similarity engine is DataDog.

Let’s take a closer look.

Palantir SEC Profile:

The company provides Palantir Gotham, a software platform which enables users to identify patterns hidden deep within datasets, ranging from signals intelligence sources to reports from confidential informants, as well as facilitates the handoff between analysts and operational users, helping operators plan and execute real-world responses to threats that have been identified within the platform. It also offers Palantir Foundry, a platform that transforms the ways organizations operate by creating a central operating system for their data; and allows individual users to integrate and analyze the data they need in one place. In addition, it provides Palantir Apollo, a software that delivers software and updates across the business, as well as enables customers to deploy their software virtually in any environment; and Palantir Artificial Intelligence Platform (AIP) that provides unified access to open-source, self-hosted, and commercial large language models (LLM) that can transform structured and unstructured data into LLM-understandable objects and can turn organizations’ actions and processes into tools for humans and LLM-driven agents.

Datadog SEC Profile:

Datadog, Inc. provides monitoring and analytics platform for developers, information technology operations teams, and business users in the cloud in North America and internationally. The company’s SaaS platform integrates and automates infrastructure monitoring, application performance monitoring, log management, and security monitoring to provide real-time observability of its customers technology stack. Its platform also provides user experience monitoring, network performance monitoring, cloud security, developer-focused observability, and incident management, as well as a range of shared features, such as dashboards, analytics, collaboration tools, and alerting capabilities.

Although the score is lower than what we saw between Apple and Amazon, I certainly do see how these two companies could be considered competitors.

Both companies' primary line of business is software platforms with a general focus on data and analytics.

get_top_n_similar(symbol = 'PLTR',df = df)

Image provided by the author

A Transformer-Based Alternative for Identifying Competitors

Learn Business Buying & Scaling In 3 Days

🚀 Your Investing Journey Just Got Better: Premium Subscriptions Are Here! 🚀

👉 Explore Premium Plans Now

Introduction

Tax automation that clears your desk and your mind

What are Sentence Transformers?

Libraries & Data

Text Preprocessing

100 Genius Side Hustle Ideas

Fitting the Model

all-roberta-large-v1

Find Competitors with Cosine Similarity

Keep Reading

GuruFinance Insights