This function is used to scrape titles (h1, h2 & h3 html tags) from a website. Useful for scraping daily electronic newspapers' titles.
titles_scrap(link, contain = NULL, case_sensitive = FALSE, askRobot = FALSE)
the link of the web page to scrape
filter the titles according to a character string provided.
logical. Should the contain argument be case sensitive ? defaults to FALSE
logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE
a character vector
# \donttest{
# Extracting the current titles of the New York Times
link <- "https://www.nytimes.com/"
titles_scrap(link)# }
#> [1] "New York Times - Top Stories" "What to Watch and Read"
#> [3] "More News" "The AthleticSports coverage"
#> [5] "Well" "Culture and Lifestyle"
#> [7] "AudioPodcasts and narrated articles" "GamesDaily puzzles"
#> [9] "Site Index" "Site Information Navigation"
#> [11] "Sections" "Top Stories"
#> [13] "Newsletters" "Podcasts"
#> [15] "Sections" "Top Stories"
#> [17] "Newsletters" "Sections"
#> [19] "Top Stories" "Newsletters"
#> [21] "Podcasts" "Sections"
#> [23] "Recommendations" "Newsletters"
#> [25] "Podcasts" "Sections"
#> [27] "Columns" "Newsletters"
#> [29] "Podcasts" "Sections"
#> [31] "Topics" "Columnists"
#> [33] "Podcasts" "Audio"
#> [35] "Listen" "Featured"
#> [37] "Newsletters" "Games"
#> [39] "Play" "Community"
#> [41] "Newsletters" "Cooking"
#> [43] "Recipes" "Editors' Picks"
#> [45] "Newsletters" "Wirecutter"
#> [47] "Reviews" "The Best..."
#> [49] "Newsletters" "The Athletic"
#> [51] "Leagues" "Top Stories"
#> [53] "Newsletters" "Play"
#> [55] "Sections" "Top Stories"
#> [57] "Newsletters" "Podcasts"
#> [59] "Sections" "Top Stories"
#> [61] "Newsletters" "Sections"
#> [63] "Top Stories" "Newsletters"
#> [65] "Podcasts" "Sections"
#> [67] "Recommendations" "Newsletters"
#> [69] "Podcasts" "Sections"
#> [71] "Columns" "Newsletters"
#> [73] "Podcasts" "Sections"
#> [75] "Topics" "Columnists"
#> [77] "Podcasts" "Audio"
#> [79] "Listen" "Featured"
#> [81] "Newsletters" "Games"
#> [83] "Play" "Community"
#> [85] "Newsletters" "Cooking"
#> [87] "Recipes" "Editors' Picks"
#> [89] "Newsletters" "Wirecutter"
#> [91] "Reviews" "The Best..."
#> [93] "Newsletters" "The Athletic"
#> [95] "Leagues" "Top Stories"
#> [97] "Newsletters" "Play"