Launch an R script using github actions
The easiest way to do that is to duplicate this repository on GitHub
GitHub - pixgarden/scrape-automation: Scrape automation demo for https://www.rforseo.com/
GitHub
Just push the "Fork" button to create your copy.
Let me explain how it works. It's basically all about two files:

sitemap_scraping.R

this is the classic R script. It reaches this website XML sitemap and counts the number of url submitted. It relies on rvest package ( see article about rvest )
1
#Load library
2
library(tidyverse)
3
library(rvest)
4
5
# declare XML sitemap url
6
url <- 'https://www.rforseo.com/sitemap.xml'
7
8
# grab html
9
10
url_html <- read_html(url)
11
12
# Select all the <loc>'s
13
# and count them
14
15
nbr_url <- url_html %>%
16
html_nodes("loc") %>%
17
length()
18
19
# create a new row of data, with todayd's date and urls number
20
row <- data.frame(Sys.Date(), nbr_url)
21
22
# append at the end of the csv the new data
23
write_csv(row,paste0('data/xml_url_count.csv'),append = T)
Copied!

main.yml

This is where we are going to schedule the process.
1
name: sitemap_scraping
2
3
# Controls when the action will run.
4
on:
5
schedule:
6
- cron: '0 13 * * *'
7
8
9
jobs:
10
autoscrape:
11
# The type of runner that the job will run on
12
runs-on: macos-latest
13
14
# Load repo and install R
15
steps:
16
- uses: actions/checkout@master
17
- uses: r-lib/actions/setup-r@master
18
19
# Set-up R
20
- name: Install packages
21
run: |
22
R -e 'install.packages("tidyverse")'
23
R -e 'install.packages("rvest")'
24
# Run R script
25
- name: Scrape
26
run: Rscript sitemap_scraping.R
27
28
# Add new files in data folder, commit along with other modified files, push
29
- name: Commit files
30
run: |
31
git config --local user.name actions-user
32
git config --local user.email "[email protected]"
33
git add data/*
34
git commit -am "GH ACTION Headlines $(date)"
35
git push origin main
36
env:
37
REPO_KEY: ${{secrets.GITHUB_TOKEN}}
38
username: github-actions
Copied!
Parts you may want to modify are
  • the execution frequency rule. It's the weird line with cron. this one means " Runs at 13:00 UTC every day." here is the full syntax documentation.
  • If you are using packages, you need to ask Github to install them before running the script so be sure to include those on the list.
the resulting CSV is updated every day and can be scrape
scrape-automation/xml_url_count.csv at main · pixgarden/scrape-automation
GitHub

Last modified 8mo ago