This function is used to scrape a tibble from a website.
tidy_scrap(link, nodes, colnames, clean = FALSE, askRobot = FALSE)
link | the link of the web page to scrape |
---|---|
nodes | the vector of HTML or CSS elements to consider, the SelectorGadget tool is highly recommended. |
colnames | the names of the expected columns. |
clean | logical. Should the function clean the extracted tibble or not ? Default is FALSE. |
askRobot | logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE. |
a tidy data frame.
# \donttest{ # Extracting imdb movie titles and rating link <- "https://www.imdb.com/chart/top/" my_nodes <- c(".titleColumn a", "strong") names <- c("title", "rating") tidy_scrap(link, my_nodes, names)# }#> # A tibble: 250 x 2 #> title rating #> <chr> <chr> #> 1 The Shawshank Redemption 9.2 #> 2 The Godfather 9.1 #> 3 The Godfather: Part II 9.0 #> 4 The Dark Knight 9.0 #> 5 12 Angry Men 8.9 #> 6 Schindler's List 8.9 #> 7 The Lord of the Rings: The Return of the King 8.9 #> 8 Pulp Fiction 8.8 #> 9 Il buono, il brutto, il cattivo 8.8 #> 10 The Lord of the Rings: The Fellowship of the Ring 8.8 #> # ... with 240 more rows