Website Tidy scraping — tidy

This function is used to scrape a tibble from a website.

tidy_scrap(link, nodes, colnames, clean = FALSE, askRobot = FALSE)

Arguments

link: the link of the web page to scrape
nodes: the vector of HTML or CSS elements to consider, the SelectorGadget tool is highly recommended.
colnames: the names of the expected columns.
clean: logical. Should the function clean the extracted tibble or not ? Default is FALSE.
askRobot: logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE.

Value

a tidy data frame.

Examples

# \donttest{
# Extracting imdb movie titles and rating
link     <- "https://www.imdb.com/chart/top/"
my_nodes <- c("a > h3.ipc-title__text", "span.ratingGroup--imdb-rating")
names    <- c("title", "rating")
tidy_scrap(link, my_nodes, names)
#> # A tibble: 25 × 2
#>    title                                          rating    
#>    <chr>                                          <chr>     
#>  1 1. Die Verurteilten                            9.3 (3.1M)
#>  2 2. Der Pate                                    9.2 (2.1M)
#>  3 3. The Dark Knight                             9.0 (3M)  
#>  4 4. Der Pate 2                                  9.0 (1.4M)
#>  5 5. Die zwölf Geschworenen                      9.0 (936K)
#>  6 6. Der Herr der Ringe: Die Rückkehr des Königs 9.0 (2.1M)
#>  7 7. Schindlers Liste                            9.0 (1.5M)
#>  8 8. Pulp Fiction                                8.8 (2.4M)
#>  9 9. Der Herr der Ringe: Die Gefährten           8.9 (2.1M)
#> 10 10. Zwei glorreiche Halunken                   8.8 (861K)
#> # ℹ 15 more rows
# }