This function scrapes a web page for all links (<a> tags) and extracts both
the URLs and the link text.
Usage
scrape_link(url, sort_by = c("link", "link_text"))Value
A tibble with two columns: link_text containing the text of each
link, and link containing the absolute URL of each link. The tibble is
sorted by link and then by link text, and only unique links are included.
Examples
head(scrape_link(url = "https://github.com/tidyverse/dplyr"))
#> # A tibble: 6 × 2
#>   link_text          link                                                       
#>   <chr>              <chr>                                                      
#> 1 Acero              https://arrow.apache.org/docs/cpp/streaming_execution.html 
#> 2 arrow              https://arrow.apache.org/docs/r/                           
#> 3 dbplyr             https://dbplyr.tidyverse.org/                              
#> 4 Documentation      https://docs.github.com                                    
#> 5 Docs               https://docs.github.com/                                   
#> 6 Search syntax tips https://docs.github.com/search-github/github-code-search/u…
head(
  scrape_link(
    url = "https://github.com/tidyverse/dplyr", sort_by = "link_text"))
#> # A tibble: 6 × 2
#>   link_text          link                                                      
#>   <chr>              <chr>                                                     
#> 1 + 266 contributors https://github.com/tidyverse/dplyr/graphs/contributors    
#> 2 + 42 releases      https://github.com/tidyverse/dplyr/releases               
#> 3 .Rbuildignore      https://github.com/tidyverse/dplyr/blob/main/.Rbuildignore
#> 4 .github            https://github.com/tidyverse/dplyr/tree/main/.github      
#> 5 .gitignore         https://github.com/tidyverse/dplyr/blob/main/.gitignore   
#> 6 .vscode            https://github.com/tidyverse/dplyr/tree/main/.vscode      
# This will give an "Invalid url" error
try(scrape_link(url = "https://github50.com"))
#> Error in scrape_link(url = "https://github50.com") : 
#>   Invalid url
#> 
#> ----- Metadata -----
#> 
#> url [url]: <character>
#> https://github50.com