🤖
R for SEO
  • Using R for SEO, What to expect?
  • Getting started
  • What is R? What is SEO?
  • About this Book
  • Crawl and extract data
    • What's crawling and why is it useful?
    • Download and check XML sitemaps using R'
    • Crawling with rvest
    • Website Crawling and SEO extraction with Rcrawler
    • Perform automatic browser tests with RSelenium
  • Grabbing data from APIs
    • Grab Google Suggest Search Queries using R'
    • Grab Google Analytics Data x
    • Grab keywords search volume from DataForSeo API using R'
    • Grab Google Rankings from VALUE SERP API using R'
    • Classify SEO Keywords using GPT-3 & R'
    • Grab Google Search Console Data x
    • Grab 'ahrefs' API data x
    • Grab Google Custom search API Data x
    • Send requests to the Google Indexing API using googleAuthR
    • other APIs x
  • Export and read Data
    • Send and read SEO data to Excel/CSV
    • Send your data by email using gmail API
    • Send and read SEO data to Google Sheet x
  • data wrangling & analysis
    • Join Crawl data with Google Analytics Data
    • Count words, n-grams, shingles x
    • Hunt down keyword cannibalization
    • Duplicate content analysis x
    • Compute ‘Internal Page Rank’
    • SEO traffic Forecast x
    • URLs categorization
    • Track SEO active pages percentage over time x
  • Data Viz
    • Why Data visualisation is important? x
    • Use Esquisse to create plots quickly
  • Explore data with rPivotTable
  • Resources
    • Launch an R script using github actions
    • Types / Class & packages x
    • SEO & R People x
    • Execute R code online
    • useful SEO XPath's & CSS selectors X
Powered by GitBook
On this page
  • What is keyword cannibalization?
  • Does it matter SEO-wise?
  • How to check for keyword cannibalization?
  • step 0: install R & rstudio
  • step 1: install the necessary packages
  • step 2 – gather DATA
  • step 3 – clean up
  • step 4 – computations
  • step 5 – analysis

Was this helpful?

  1. data wrangling & analysis

Hunt down keyword cannibalization

⚠️ THIS IS A WORK IN PROGRESS

PreviousCount words, n-grams, shingles xNextDuplicate content analysis x

Last updated 3 years ago

Was this helpful?

What is keyword cannibalization?

if you put a lot of articles out there, at some point, some articles will compete with one another for the same keywords in Google result pages. it’s what SEO people call ‘keyword cannibalization’.

Does it matter SEO-wise?

Sometimes it’s perfectly normal. I hope, for your sake, that several of your web pages show up when someone is typing your brand name in Google.

Sometimes it’s not. Let me give an example:

💭 Imagine you run an e-commerce website, with various page type: products, FAQ’s, blog posts, …

At some point, Google decided to make a switch: a couple of Google search queries that were sending traffic to product pages, now display a blog post of yours.

Inside Google Analytics, SEO sessions count is the same. Your ‘Rank Tracking’ software will not bring you up any position changes.

And yet, these blog post pages will be able to convert much less, and at the end of the month, this will result in a decrease in sales.

Sometimes it can't be fixed, the search intent is now different but sometimes it just because you neglected your product pages. Either way, it's good to know what's happening.

How to check for keyword cannibalization?

There are several ways to do it. Of course, SEO tools people want you to use their tools, the is definitely useful. Unfortunately, this kind of tool can be sometimes imprecise, it doesn’t take into account what’s really happening.

So let do it using Google Search Console and R’. Once set up, you’ll be able to check big batches of keywords in minutes.

step 0:

step 1: install the necessary packages

install.packages("searchConsoleR")
library(searchConsoleR)

Then let’s load tidyverse. For those who don’t know about it, it’s a very popular master package that will allow us to work with data frames and in a graceful way.

install.packages("tidyverse")
library(tidyverse)

and finally, something to help to deal with Google Account Authentication (still by Mark Edmondson). It will spare the pain of having to set up an API Key.

install.packages("googleAuthR")
library(googleAuthR)

step 2 – gather DATA

Let’s initiate authentification. This should open a new browser window, asking you to validate access to your GSC account. The script will be allowed to make requests for a limited period of time.

scr_auth()

This will create a sc.oauth file inside your working directory. It stores your temporary Access tokens. If you wish to switch between Google accounts, just delete the file, re-run the command and log in with another account.

Let’s list all websites we are allowed to send requests about:

sc_websites <- list_websites()
View(sc_websites)

and pick one

hostname <- "https://www.example.com/"

don’t forget to update this with your hostname

As you may know, Search Console data is not available right away. That’s why we want to request data for the last available 2 months, so between 3 days ago and 2 months before that… again using a little useful package!

install.packages("lubridate")
require(lubridate)
tree_days_ago <- lubridate::today()-3
beforedate <- tree_days_ago
month(beforedate) <- month(beforedate) - 2
day(beforedate) <- days_in_month(beforedate)

and now the actual request (at last!)

gsc_all_queries <- search_analytics(hostname,
                                beforedate, tree_days_ago,
                                c("query", "page"), rowLimit = 80000)

There is no point in asking for a longer time period. We want to know if our web pages currently compete with one another now.

rowLimit is a bit of a big random number, this should be enough. If you have a popular website, with a lot of long tail traffic. You might need to increase it.

API respond is store inside gbr_all_queries variable as a data frame.

bind_rows(gsc_queries_1,gsc_queries_2)

step 3 – clean up

First, we’ll filter out queries that are not on the 2 first SERPs and that doesn’t generate any click. There is no point of making useless time-consuming calculations.

We’ll also remove branded search queries using a regex. As said earlier, having several positions for your brand name is pretty classic and shouldn’t be seen as a problem.

gsc_queries_filtered <-gsc_all_queries %>%
    filter(position<=20) %>%
    filter(clicks!=0) %>%
    filter(!str_detect(query, 'brandname|brand name'))

update this with your brand name

step 4 – computations

We want to know for one query, what percentage of clicks are going to each landing page.

First, we’ll create a new column clicksT with the aggregated number of clicks for each search query. Then, using this value to calculate what we need inside a new per column.

gsc_queries_computed <- gsc_queries_filtered %>%
                                group_by(query) %>%
                                mutate(clicksT= sum(clicks)) %>%
                                group_by(page, add=TRUE) %>%
                                mutate(per=round(100*clicks/clicksT,2)) 

View(gsc_queries_computed)

A per column value of 100 means that all clicks go the same URL.

Last final steps, we will sort rows

gsc_queries_final <- gsc_queries_computed %>%
                                arrange(desc(clicksT))

It could also make sense to remove rows where cannibalization is not significant. Where per column value is not very high.

Removing now useless columns: click, impression and total click per query group

gsc_queries_final <-gsc_queries_final[,c(-3,-4,-7)]

Now it’s your choice to display it inside rstudio

View(gsc_queries_final)

Or write a CSV file to open it elsewhere

write.csv(gsc_queries_final,"./gsc_queries_final.csv")

Here is my rstudio view (anonymized sorry 🙊)

step 5 – analysis

You should check data inside each “query pack”. Everything is sorted using the total number of clicks, so, first rows are critical, bottoms rows not so much.

To help you deal with this, let’s check the first one’s

For Search query 1: 97% of click are going to the same page. Their is no Keyword cannibalization here. It’s interesting to notice that the ‘second’ landing page, only earn 1,4% of clicks, even though, it got an average position of 1,5. Users really don’t like the second ‘Langing page’. Page metadata probably sucks.

Check if the first landing page is the right one and we should move on.

For Search query 2: 63% of clicks are going to the first landing page. 36% to the second page. This is Keyword cannibalization. It could make sense to adapt internal linking between involved landing pages to influence which one should rank before the other one’s, depending on your goals, pages bounce rates, etc.

And so on…

This is it, I hope you’ll find it be useful.

First, we’ll load searchConsoleR package by . This will allow us to send requests to Google ‘Search Console API’ very easily.

We are requesting ‘query’ and ‘page’ dimensions. If you wish, it’s possible to restrict request to some type of user device, like ‘desktop only’. See function

If you happen to have several domains/subdomains that compete with each other for the same keywords, this process should be repeated. The results will have to be aggregated, function will help you bind them together. This is how to use it :

Mark Edmondson
documentation.
bind_rows
method from ahref
install R & rstudio