How to Use Machine Learning for SEO Competitor Research

17Jun - by aiuniverse - 0 - In Machine Learning

Source – https://www.searchenginejournal.com/

Learn how to use machine learning for more precise, statistically relevant, and scalable SEO competitor research (with tools, code & more).

With the ever-increasing appetite of SEO professionals to learn Python, there’s never been a better or more exciting time to take advantage of machine learning’s (ML) capabilities and apply these to SEO.

This is especially true in your competitor research.

In this column, you’ll learn how machine learning helps address common challenges in SEO competitor research, how to set up and train your ML model, how to automate your analysis, and more.

Let’s do this!

Why We Need Machine Learning in SEO Competitor Research

Most if not all SEO pros working in competitive markets will analyze the SERPs and their business competitors to find out what it is their site is doing to achieve a higher rank.

Back in 2003, we used spreadsheets to collect data from SERPs, with columns representing different aspects of the competition such as the number of links to the home page, number of pages, etc.

In hindsight, the idea was right but the execution was hopeless due to the limitations of Excel in performing a statistically robust analysis in the short time required.

And if the limits of spreadsheets weren’t enough, the landscape has moved on quite a bit since then as we now have:

  • Mobile SERPs.
  • Social media.
  • A much more sophisticated Google Search experience.
  • Page Speed.
  • Personalized search.
  • Schema.
  • Javascript frameworks and other new web technologies.

The above is by no means an exhaustive list of trends but serves to illustrate the ever-increasing range of factors that can explain the advantage of your higher-ranked competitors in Google.

Machine Learning in the SEO Context

Thankfully, with tools like Python/R, we’re no longer subject to the limits of spreadsheets. Python/R can handle millions to billions of rows of data.

If anything, the limit is the quality of data you can feed into your ML model and the intelligent questions you ask of your data.

As an SEO professional, you can make the decisive difference to your SEO campaign by cutting through the noise and using machine learning on competitor data to discover:

  • Which ranking factors can best explain the differences in rankings between sites.
  • What the winning benchmark is.
  • How much a unit change in the factor is worth in terms of rank.

Like any (data) science endeavor, there are a number of questions to be answered before we can start coding.

What Type of ML Problem is Competitor Analysis?

ML solves a number of problems whether it’s categorizing things (classification) or predicting a continuous number (regression).

In our particular case, since the quality of a competitor’s SEO is denoted by its rank in Google, and that rank is a continuous number, then the ML problem is one of regression.

Outcome Metric

Given that we know the ML problem is one of regression, the outcome metric is rank. This makes sense for a number of reasons:

  • Rank won’t suffer from seasonality; an ice cream brand’s rankings for searches on [ice cream] won’t depreciate because it’s winter, unlike the “users” metric.
  • Competitor rank is third-party data and is available using commercial SEO tools, unlike their user traffic and conversions.

What Are the Features?

Knowing the outcome metric, we must now determine the independent variables or model inputs also known as features. The data types for the feature will vary, for example:

  • First paint measured in seconds would be a numeric.
  • Sentiment with the categories positive, neutral, and negative would be a factor.

Naturally, you want to cover as many meaningful features as possible including technical, content/UX, and offsite for the most comprehensive competitor research.

What Is the Math?

Given that rankings are numeric, and that we want to explain the difference in rank, then in mathematical terms:

rank ~ w_1*feature_1 + w_2*feature_2 + … + w_n*feature_n

~ (known as the “tilde”) means “explained by”

n being the nth feature

w is the weighting of the feature

Using Machine Learning to Uncover Competitor Secrets

With the answers to these questions in hand, we’re ready to see what secrets machine learning can reveal about your competition.

At this point, we will assume that your data (known in this example as “serps_data”) has been joined, transformed, cleaned, and is now ready for modeling.

As a minimum, this data will contain the Google rank and feature data you want to test.

For example, your columns could include:

  • Google_rank.
  • Page_speed.
  • Sentiment.
  • Flesch_kincaid_reading_ease.
  • Amp_version_available.
  • Site_depth.
  • Internal_page_rank.
  • Referring_comains count.
  • avg_domain_authority_backlinks.
  • title_keyword_string_distance.

Training Your ML Model

To train your model, we’re using XGBoost because it tends to deliver better results than other ML models.

Alternatives you may wish to trial in parallel are LightGBM (especially for much larger datasets), RandomForest, and Adaboost.

Try using the following Python code for XGBoost for your SERPs dataset:

# import the libraries

import xgboost as xgb

import pandas as pd

serps_data = pd.read_csv('serps_data.csv')

# set the model variables

# your SERPs data with everything but the google_rank column

serp_features = serps_data.drop(columns = ['Google_rank'])

# your SERPs data with just the google_rank column

rank_actual = serps_data.Google_rank

# Instantiate the model

serps_model = xgb.XGBRegressor(objective='reg:linear', random_state=1231)

# fit the model

serps_model.fit(serp_features, rank_actual)

# generate the model predictions

rank_pred = serps_model.predict(serp_features)

# evaluate the model accuracy

mse = mean_squared_error(rank_actual, rank_pred)

Note that the above is very basic. In a real client scenario, you’d want to trial a number of model algorithms on a training data sample (about 80% of the data), evaluate (using the remaining 20% data), and select the best model.

Facebook Comments