🤖
R for SEO
  • Using R for SEO, What to expect?
  • Getting started
  • What is R? What is SEO?
  • About this Book
  • Crawl and extract data
    • What's crawling and why is it useful?
    • Download and check XML sitemaps using R'
    • Crawling with rvest
    • Website Crawling and SEO extraction with Rcrawler
    • Perform automatic browser tests with RSelenium
  • Grabbing data from APIs
    • Grab Google Suggest Search Queries using R'
    • Grab Google Analytics Data x
    • Grab keywords search volume from DataForSeo API using R'
    • Grab Google Rankings from VALUE SERP API using R'
    • Classify SEO Keywords using GPT-3 & R'
    • Grab Google Search Console Data x
    • Grab 'ahrefs' API data x
    • Grab Google Custom search API Data x
    • Send requests to the Google Indexing API using googleAuthR
    • other APIs x
  • Export and read Data
    • Send and read SEO data to Excel/CSV
    • Send your data by email using gmail API
    • Send and read SEO data to Google Sheet x
  • data wrangling & analysis
    • Join Crawl data with Google Analytics Data
    • Count words, n-grams, shingles x
    • Hunt down keyword cannibalization
    • Duplicate content analysis x
    • Compute ‘Internal Page Rank’
    • SEO traffic Forecast x
    • URLs categorization
    • Track SEO active pages percentage over time x
  • Data Viz
    • Why Data visualisation is important? x
    • Use Esquisse to create plots quickly
  • Explore data with rPivotTable
  • Resources
    • Launch an R script using github actions
    • Types / Class & packages x
    • SEO & R People x
    • Execute R code online
    • useful SEO XPath's & CSS selectors X
Powered by GitBook
On this page
  • Install xsitemap R’ Package (to be done once) and Load
  • Find and fetch XML sitemaps
  • Check URLs HTTP code
  • Count HTTP codes
  • Plot the years the pages were added

Was this helpful?

  1. Crawl and extract data

Download and check XML sitemaps using R'

PreviousWhat's crawling and why is it useful?NextCrawling with rvest

Last updated 3 years ago

Was this helpful?

If you are coming from Google you don't know anything about R' and just want to download an XML sitemap, use this tool If you want to learn how to do it yourself, keep on reading ⌄

It's not required to submit an XML sitemap to have a successful website but it's definitely an SEO nice to have.

Nevertheless, if you do submit one, it's best to make sure it's error-free and as you will see its is quite straightforward to extract URLs using R

Install xsitemap R’ Package (to be done once) and Load

# Installing libraries and Loading libraries
install.packages("devtools")
library(devtools)
install_github("pixgarden/xsitemap")
library(xsitemap)

Find and fetch XML sitemaps

xsitemap_urls <- xsitemapGet("https://www.nationalarchives.gov.uk/")

This function will first search for XML sitemap url. It will first check the robots.txt file to see if an XML sitemap url is explicitly declared.

if not, the script will do some random guess (‘sitemap.xml’, ‘sitemap_index.xml’ , …) most of the time, it will find the XML sitemap url.

Then, the XML sitemap URL is fetched and the URLs extracted.

If it’s a classic XML sitemap, a data frame (a special kind of array) will be produced and returned.

If it’s an index XML sitemap, the process will get back from the start with every XML sitemap inside.

This will produce a data frame with all the information extracted.

View(xsitemap_urls)

Check URLs HTTP code

Another interesting function allows you to crawl the sitemap URLs and verify if your web pages send proper 200 HTTP codes, using HEAD Requests (easier on the website server)

xsitemap_urls_http <- xsitemapCheckHTTP(xsitemap_urls)

It will add a dedicated column with the HTTP code filled in. You can check data inside rstudio by using

View(xsitemap_urls_http)

Count HTTP codes

View(table(xsitemap_urls_http$http))

to discover, at the time of writing that most of the XML sitemap URLs are actually redirects...

Plot the years the pages were added

You might have noticed that in this XML sitemap with a "lastmod" field. This is an optional field that explicitly declares to Google last modification date. This allows theoretically Google to optimise website crawls.

It also allows us to understand how fresh is one's website content as we can plot it


library(ggplot2)




ggplot(xsitemap_urls) +

 aes(x = lastmod) +

 geom_histogram(bins = 90L, fill = "#112446") +
 theme_minimal()

Let's try to get a clearer picture by extracting years

# We extract from the Date the year and
# store the value into a new column called 'year'
xsitemap_urls$year <- format(xsitemap_urls$lastmod,"%Y")


# Removing not available values (NA's)
# and ploting the url by year
xsitemap_urls %>%

 filter(!is.na(year)) %>%
 
 ggplot() +

  aes(x = year) +
  geom_bar(fill = "#112446") +

  theme_minimal()
  

If you prefer a % cumulative view:

plot(ecdf(xsitemap_urls$year))

It can take some time depending on the number of URLs. It took several hours for for example.

or if you prefer,

Like in the , it's quite easy to count HTTP codes

(I've got help from the library)

https://gokam.shinyapps.io/xsitemap/
https://www.gov.uk/
generate a CSV
intro
esquisse
Most of the content originated from 2014-2015, The oldest page have been updated in 2001