This function is used to scrape a tibble from a website.
tidy_scrap(link, nodes, colnames, clean = FALSE, askRobot = FALSE)
the link of the web page to scrape
the vector of HTML or CSS elements to consider, the SelectorGadget tool is highly recommended.
the names of the expected columns.
logical. Should the function clean the extracted tibble or not ? Default is FALSE.
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE.
a tidy data frame.
# \donttest{
# Extracting imdb movie titles and rating
link <- "https://www.imdb.com/chart/top/"
my_nodes <- c("a > h3.ipc-title__text", "span.ratingGroup--imdb-rating")
names <- c("title", "rating")
tidy_scrap(link, my_nodes, names)
#> # A tibble: 25 × 2
#> title rating
#> <chr> <chr>
#> 1 1. Die Verurteilten 9.3 (3.1M)
#> 2 2. Der Pate 9.2 (2.1M)
#> 3 3. The Dark Knight 9.0 (3M)
#> 4 4. Der Pate 2 9.0 (1.4M)
#> 5 5. Die zwölf Geschworenen 9.0 (936K)
#> 6 6. Der Herr der Ringe: Die Rückkehr des Königs 9.0 (2.1M)
#> 7 7. Schindlers Liste 9.0 (1.5M)
#> 8 8. Pulp Fiction 8.8 (2.4M)
#> 9 9. Der Herr der Ringe: Die Gefährten 8.9 (2.1M)
#> 10 10. Zwei glorreiche Halunken 8.8 (861K)
#> # ℹ 15 more rows
# }