ralger is a package that aims to facilitate to the maximum web scraping in R. For scraping some data, you only need two elements, the link of the web page and the html or css node that references the needed information. Don’t panic, you don’t have to spend hours learning html and css. You can just use the SelectorGadget chrome extension. You can check out this tutorial for more information.


scrap()

Let’s dive into an example ! Suppose we want to extract all Golden Globes Best Actress Nominees (including the winner). In ralger you need only two elements:

And that’s it, we’re ready to scrap !

2020 best actress winner

2020 best actress winner

##  [1] "Renée Zellweger"    "Cynthia Erivo"      "Scarlett Johansson"
##  [4] "Saoirse Ronan"      "Charlize Theron"    "Glenn Close"       
##  [7] "Lady Gaga"          "Nicole Kidman"      "Melissa McCarthy"  
## [10] "Rosamund Pike"      "Frances McDormand"  "Meryl Streep"      
## [13] "Michelle Williams"  "Jessica Chastain"   "Sally Hawkins"     
## [16] "Isabelle Huppert"   "Amy Adams"          "Jessica Chastain"  
## [19] "Ruth Negga"         "Natalie Portman"

Pretty simple right ? I hope so. Anyway, the problem here is that the main page displays only 20 nominees, from 2017 to 2020. What if we wanted to extract all nominees in history ? Indeed, you’re right, we’d have to scroll multiple pages (20 to be exact) across the website. In this context, we need to use paste() in conjunction with scrap() as follows:

## [1] 349

And here we’re we have our 350 nominees !!! 😄


tidy_scrap()

Now, imagine that we need a data frame composed of two columns :

  • Actress: The names of Golden Globe Best Actress Nominees,
  • Movie: The movie title for which they were nominated,

To construct our data frame we’ll use the tidy_scrap() function as follows:

links <- paste(link, 0:20, sep = "") # The links required to extract the 350 observations

nodes <- c(".primary-nominee a", ".secondary-nominee")

column_names <- c("Actress", "Movie")


global_df <- tidy_scrap(links, nodes, column_names)


head(global_df, n = 10)
## # A tibble: 10 x 2
##    Actress            Movie                   
##    <chr>              <chr>                   
##  1 Renée Zellweger    Judy                    
##  2 Cynthia Erivo      Harriet                 
##  3 Scarlett Johansson Marriage Story          
##  4 Saoirse Ronan      Little Women            
##  5 Charlize Theron    Bombshell               
##  6 Glenn Close        Wife, The               
##  7 Lady Gaga          Star Is Born, A (2018)  
##  8 Nicole Kidman      Destroyer               
##  9 Melissa McCarthy   Can You Ever Forgive Me?
## 10 Rosamund Pike      Private War, A

If you have any feedback don’t hesitate to make a pull request or reach out on Twitter.