This function will first search for XML sitemap url. It will first check the robots.txt file to see if an XML sitemap url is explicitly declared.
if not, the script will do some random guess (‘sitemap.xml’, ‘sitemap_index.xml’ , …) most of the time, it will find the XML sitemap url.
Then, the XML sitemap URL is fetched and the URLs extracted.
If it’s a classic XML sitemap, a data frame (a special kind of array) will be produced and returned.
If it’s an index XML sitemap, the process will get back from the start with every XML sitemap inside.
This will produce a data frame with all the information extracted.
1
View(xsitemap_urls)
Copied!
Check URLs HTTP code
Another interesting function allows you to crawl the sitemap URLs and verify if your web pages send proper 200 HTTP codes, using HEAD Requests (easier on the website server)
It can take some time depending on the number of URLs. It took several hours for https://www.gov.uk/ for example.
Like in the intro, it's quite easy to count HTTP codes
1
View(table(xsitemap_urls_http$http))
Copied!
to discover, at the time of writing that most of the XML sitemap URLs are actually redirects...
Plot the years the pages were added
You might have noticed that in this XML sitemap with a "lastmod" field. This is an optional field that explicitly declares to Google last modification date. This allows theoretically Google to optimise website crawls.
It also allows us to understand how fresh is one's website content as we can plot it