Skip to contents

This function scrapes a web page for all links (<a> tags) and extracts both the URLs and the link text.

Usage

scrape_link(url, sort_by = c("link", "link_text"))

Arguments

url

Character. The URL of the web page to scrape. This URL is also used to resolve relative links to absolute URLs if no <base> tag is found.

sort_by

Character vector of length 1 or 2. The columns to arrange the output by. The default is c("link", "link_text").

Value

A tibble with two columns: link_text containing the text of each link, and link containing the absolute URL of each link. The tibble is sorted by link and then by link text, and only unique links are included.

Examples

head(scrape_link(url = "https://github.com/tidyverse/dplyr"))
#> # A tibble: 6 × 2
#>   link_text       link                                                 
#>   <chr>           <chr>                                                
#> 1 Archive Program https://archiveprogram.github.com                    
#> 2 Acero           https://arrow.apache.org/docs/cpp/acero/overview.html
#> 3 arrow           https://arrow.apache.org/docs/r/                     
#> 4 dbplyr          https://dbplyr.tidyverse.org/                        
#> 5 Documentation   https://docs.github.com                              
#> 6 Docs            https://docs.github.com/                             

head(
  scrape_link(
    url = "https://github.com/tidyverse/dplyr", sort_by = "link_text"))
#> # A tibble: 6 × 2
#>   link_text     link                                                      
#>   <chr>         <chr>                                                     
#> 1 + 44 releases https://github.com/tidyverse/dplyr/releases               
#> 2 .Rbuildignore https://github.com/tidyverse/dplyr/blob/main/.Rbuildignore
#> 3 .claude       https://github.com/tidyverse/dplyr/tree/main/.claude      
#> 4 .github       https://github.com/tidyverse/dplyr/tree/main/.github      
#> 5 .gitignore    https://github.com/tidyverse/dplyr/blob/main/.gitignore   
#> 6 .vscode       https://github.com/tidyverse/dplyr/tree/main/.vscode      

# This will give an "Invalid url" error
try(scrape_link(url = "https://github50.com"))
#> Error in scrape_link(url = "https://github50.com") : 
#>   Invalid url
#> 
#> ----- Metadata -----
#> 
#> url [url]: <character>
#> https://github50.com