Website Crawling and SEO extraction with Rcrawler
Last updated
Was this helpful?
Last updated
Was this helpful?
This section is relying on a package called by Salim Khalil. It’s a very handy crawler with some nice functionalities.
After is being installed and launched, same as always, we’ll install and load our package:
To launch a simple website analysis, you only need this line of code:
It will crawl the entire website and provide you with the data
After the crawl is being done, you’ll have access to:
The INDEX variable
To take a look at it, just run
Most of the columns are self-explanatory. Usually, the most interesting ones are ‘Http Resp‘ and ‘Level‘
The Level is what SEOs call “crawl depth” or “page depth”. With it, you can easily check how far from the homepage some webpages are.
HTML Files
By default, the rcrawler function also store HTML files in your ‘working directory’. Update location by running setwd() function
Let’s go deeper into options by replying to the most commons questions:
It’s possible to extract any elements from webpages, using a CSS or XPath selector. We’ll have to use 2 new parameters
PatternsNames to name the new parameters
ExtractXpathPat or ExtractCSSPat to setup where to grab it in the web page
Let’s take an example:
You can access the scraped data in two ways:
option 1 = DATA – it’s an environment variable that you can directly access using the console. A small warning, it’s a ‘list’ a little less easy to read
If you want to convert it to a data frame, easier to deal with, here the code:
option 2 = extracted_data.csv It’s a CSV file that has been saved inside your working directory along with the HTML files.
It might be useful to merge INDEX and NEWDATA files, here the code
As an example, let’s try to collect webpage type using scraped body class
Let’s extract the first word and feed it inside a new column
A little bit a cleaning to make the labels easier to read
And then a quick ggplot
Want to see something even cooler?
All the HTML files are stored in your hard drive, so if you need more data extracted, it’s entirely possible.
You can list your recent crawl by using ListProjects() function,
First, we’re going to load the crawling project HTML files:
Let’s say you forgot to grab h2’s and h3’s you can extract them again using the ContentScraper() also included inside rcrawler package.
For those not afraid of regex, here is a complimentary script to categorize URLs. Be careful the regex order is important, some values can overwrite others. Usually, it’s a good idea to place the home page last
just had Obeyrobots parameter
By default, this crawler is rather quick and can grab a lot of webpage in no times. To every advantage an inconvenience, it’s fairly easy to wrongly detected as a DOS. To limit the risks, I suggest you use the parameter RequestsDelay. it’s the time interval between each round of parallel HTTP requests, in seconds. Example
Other interesting limitation options:
no_cores: specify the number of clusters (logical cpu) for parallel crawling, by default it’s the numbers of available cores.
no_conn: it’s the number of concurrent connections per one core, by default it takes the same value of no_cores.
2 parameters help you do that. crawlUrlfilter will limit the crawl, dataUrlfilter will tell from which URLs data should be extracted
option 1: Use a VPN on your computer
Option 2: use a proxy
Use the httr package to set up a proxy and use it
Where to find proxy? It’s been a while I didn’t need one so I don’t know.
By default, RCrawler doesn’t save internal links, you have to ask for them explicitly by using NetworkData option, like that:
Then you’ll have two new variables available at the end of the crawling:
NetwIndex var that is simply all the webpage URLs. The row number are the same than locally stored HTML files, so row n°1 = homepage = 1.html
NetwEdges with all the links. It’s a bit confusing so let me explain:
Weight is the Depth level where the link connection has been discovered. All the first rows are from the homepage so Level 0. Type is either 1 for internal hyperlinks or 2 for external hyperlinks
Count outbound links
To make it more readable let’s replace page IDs with URLs
Count inbound links
The same thing but the other way around
Again to make it more readable
RCrawler handly includes Phantom JS, the classic headless browser. Here is how to to use
After that, reference it as an option
It’s fairly possible to run 2 crawls, one with and one without, and compare the data afterwards
This Browser option can also be used with the other Rcrawler functions.
⚠️ Rendering webpage means every Javascript files will be run, including Web Analytics tags. If you don’t take the necessary precaution, it’ll change your Web Analytics data
ref: Khalil, S., & Fakir, M. (2017). RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106.
it’s a data frame, if don’t know what’s a data frame, it’s like an excel file. Please note that it will be overwritten every time so if you want to keep it!
Quick example with website, let’s do a quick ‘ggplot’ and we’ll be able to see pages
This is a static HTML file that can be store anywhere, even on
Full
NetwIndex data frame
NetwEdges data frame
Each row is a link. From and To columns indicate “from” which page “to” which page are each link. On the image above: row n°1 is a link from homepage (page n°1) to homepage row n°2 is a link from homepage to webpage n°2. According to NetwIndex variable, page n°2 is the article about . etc…
I guess you guys are interested in counting links. Here is the code to do it. I won’t go into too many explanations, it would be too long. if you are interested (and motivated) go and check out the package and specifically
So the useless ‘‘ has 14 links pointing at it, as many as the homepage… Maybe I should fix this one day.
Rcrawler is a great tool but it’s far from being perfect. SEO will definitely miss a couple of things like there is no internal dead links report, It doesn’t grab nofollow attributes on Links and there is always a couple of bugs here and there, but overall it’s a great tool to have. Another concern is the which is quite inactive. This is it. I hope you did find this article useful, reach to me for slow support, bugs/corrections or ideas for new articles. Take care.