Webscrape text files using R, rvest or rcurl

Question

So I have a website, https://ais.sbarc.org/logs_delimited/ , which has a bunch of links, and within each link is 24 links which have .txt files.

I'm new to R, but I'm able to loop through one link to get the 24 text files into a dataframe. But I can't figure out how to loop the whole directory.

I was able to loop the 24 links using hours.list, but the year.list and trip.list wouldn't work... I apologize if this is similar to other webscrape questions or if i'm missing something really simple but I'd appreciate any help

get_ais_text = function(ais_text){

    hours.list = c(0:23)
    hours.list_1 = sprintf('%02d', hours.list)

    year.list = c(2018:2022)
    year.list1 = sprintf('%d', year.list)

    trip.list = c(190101:191016)
    trip.list_1 = sprintf("%d", trip.list)

ais_text = tryCatch(    
lapply(paste0('https://ais.sbarc.org/logs_delimited/2019/190101/AIS_SBARC_190101-', hours.list_1,'.txt'),
                    function(url){
                      url %>% 
                        read_delim(";", col_names = sprintf("X%d", 1:25), col_types = ais_col_types)                   
                    }),
      error = function(e){NA}
    )
  DF = do.call(rbind.data.frame, ais_text)
  return(DF)
}

get_ais_text()

Carl Boneri · Accepted Answer · 2019-10-30T02:04:06.963

Here's a function that works recursively to get all the links starting with the home directory. Note that it takes a bit to run:

library(xml2)
library(magrittr)
.get_link <- function(u){
  node <- xml2::read_html(u)
  hrefs <- xml2::xml_find_all(node, ".//a[not(contains(@href,'../'))]") %>% xml_attr("href")
  urls <- xml2::url_absolute(hrefs, xml_url(node))
  if(!all(tools::file_ext(urls) == "txt")){
    lapply(urls, .get_link)
  }else {
    return(urls)
  }
}

What this is doing is basically starting with a url, and reading the contents, finding any links <a... using an xpath selector, which says "all links that are not ../" ie... not the topmost directory back link. then if the link has more links, loop through and get all of those as well. If we have the final links, ie, .txt files, we're done.

Example cheating and starting only at 2018

a <- .get_link("https://ais.sbarc.org/logs_delimited/2018/")
> a[[1]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/180101/AIS_SBARC_180101-01.txt"
> length(a)
[1] 365
> a[[365]][1:2]
[1] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-00.txt"
[2] "https://ais.sbarc.org/logs_delimited/2018/181231/AIS_SBARC_181231-01.txt"

What you would do is simply start with: https://ais.sbarc.org/logs_delimited/ for the url input, and then add something like data.table::fread to digest the data. Which I would suggest doing in a separate iteration. Something like this works:

lapply(1:length(a), function(i){
    lapply(a[[i]], data.table::fread)
})

For reading in data...

First thing to take notice of here is that there are 11,636 files. That's a lot of links to hit on someone's server at once... so I'm going to sample a few and show how to do it. I would suggest adding a Sys.sleep call into yours...

# This gets all the urls
a <- .get_link("https://ais.sbarc.org/logs_delimited/")
# This unlists and gives us a unique array of the urls
b <- unique(unlist(a))
# I'm sampling b, but you would just use `b` instead of `b[...]`
a_dfs <- jsonlite::rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
    df <- data.table::fread(i, sep = ";") %>% as.data.frame()
    # Giving the file path for debug later if needed seems helpful
    df$file_path <- i
    df
}))

> a_dfs %>% head()
  17:00:00:165              24  0 338179477 LAUREN SEA        V8 V9   V15 V16 V17 V18 V19 V20 V21 V22 V23                                                                file_path   V1   V2 V3 V4
1 17:00:00:166     EUPHONY ACE 79     71.08          1 371618000  0 254.0 253  52   0   0   0   0   5  NA https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
2 17:00:01:607 SIMONE T BRUSCO 31     32.93          3 367593050 15 255.7  97  55   0   0   1   0 503   0 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
3 17:00:01:626 POLARIS VOYAGER 89    148.80          1 311000112  0 150.0 151  53   0   0   0   0   0  22 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
4 17:00:01:631         SPECTRE 60     25.31          1 367315630  5 265.1 511  55   0   0   1   0   2  20 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
5 17:00:01:650          KEN EI 70     73.97          1 354162000  0 269.0 269  38   0   0   0   0   1  84 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
6 17:00:02:866 HANNOVER BRIDGE 70     62.17          1 372104000  0 301.1 300  56   0   0   0   0   3   1 https://ais.sbarc.org/logs_delimited/2018/180113/AIS_SBARC_180113-17.txt <NA> <NA> NA NA
  V5 V6 V7 V10 V11 V12 V13 V14 02:00:00:489 338115994  1 37 SRTG0$ 10  7  4 17:00:00:798 BROADBILL 16.84 269   18 367077090 16.3 -119.981493 34.402530 264.3 511 40
1 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
2 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
3 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
4 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
5 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA
6 NA NA NA  NA  NA  NA  NA  NA         <NA>        NA NA NA     NA NA NA NA         <NA>      <NA>    NA  NA <NA>      <NA>   NA          NA        NA    NA  NA NA

Obviously some cleaning to do.. but this is how you get to it i'd think.

Edit 2

I actually like this better, read the data in, then split the string and create forcefull the dataframe:

a_dfs <- rbind_pages(lapply(b[sample(1:length(b), 20)], function(i){
    raw <- readLines(i)
    str_matrix <- stringi::stri_split_regex(raw, "\\;", simplify = TRUE)
    as.data.frame(apply(str_matrix, 2, function(j){
        ifelse(!nchar(j), NA, j)
    })) %>% mutate(file_name = i)
}))

> a_dfs %>% head
            V1           V2 V3    V4    V5 V6 V7        V8 V9 V10  V11 V12         V13       V14   V15 V16 V17 V18 V19 V20 V21 V22  V23  V24  V25
1 09:59:57:746    STAR CARE 77 75.93   135  1  0 566341000  0   0 16.7   1 -118.839933 33.562167   321 322  50   0   0   0   0   6   19 <NA> <NA>
2 10:00:00:894     THALATTA 70 27.93 133.8  1  0 229710000  0 251 17.7   1 -119.366765 34.101742 283.9 282  55   0   0   0   0   7 <NA> <NA> <NA>
3 10:00:03:778   GULF GLORY 82 582.3   256  1  0 538007706  0   0 12.4   0 -129.345783 32.005983    87  86  54   0   0   0   0   2   20 <NA> <NA>
4 10:00:03:799    MAGPIE SW 70 68.59 123.4  1  0 352597000  0   0 10.9   0 -118.747970 33.789747 119.6 117  56   0   0   0   0   0   22 <NA> <NA>
5 10:00:09:152 CSL TECUMSEH 70 66.16 269.7  1  0 311056900  0  11   12   1 -120.846763 34.401482 105.8 106  56   0   0   0   0   6   21 <NA> <NA>
6 10:00:12:870    RANGER 85 60 31.39 117.9  1  0 367044250  0 128    0   1 -119.223133 34.162953   360 511  56   0   0   1   0   2   21 <NA> <NA>
                                                                 file_name  V26  V27
1 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
2 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
3 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
4 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
5 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>
6 https://ais.sbarc.org/logs_delimited/2018/180211/AIS_SBARC_180211-10.txt <NA> <NA>

I nicked the `fread` part from you after unsuccessfully trying to make `readr` work (and after realising this was part of the question too). Very cools solution with `xml_find_all` as well. — JBGruber, Oct 29 '19 at 23:44
As someone who has read the docs in the `curl` package a ton, I appreciated the reference to `crawler` — Carl Boneri, Oct 29 '19 at 23:47
https://github.com/jeroen/curl/blob/master/examples/sitemap.R — Carl Boneri, Oct 29 '19 at 23:56
Wow! I do have seen this not long ago and srsly question my own brain right now because I was not aware to which extend I just copied that function! :D — JBGruber, Oct 30 '19 at 00:01
@SeaGo not a problem. If it works you should accept the answer. Happy parsing! — Carl Boneri, Oct 30 '19 at 00:43
@CarlBoneri, Seriously, I can't thank you enough!!! However the... ``` lapply(1:length(a), function(i){ lapply(a[[i]], data.table::fread) }) ``` ...is giving me an error, "Internal error: unexpected field of size 0" — SeaGo, Oct 30 '19 at 01:05
probably because list of lists.... honestly do `lapply(unlist(your_list), fread)` — Carl Boneri, Oct 30 '19 at 01:14
@CarlBoneri I can't seem to get it to work... But I really appreciate your answer! Thank you! — SeaGo, Oct 30 '19 at 01:36
@carlboneri You sir are my hero! Thank you for the detailed webscrape assistance! The data is for tracking container ships and doing spatial analysis to monitor their speeds within endangered whale hotspots. So the whales thank you too! — SeaGo, Oct 30 '19 at 02:44
Oh rad! I'd love to assist with that anyway possible. So please let me know if there is anything else you need. — Carl Boneri, Oct 30 '19 at 02:58
@CarlBoneri have you used rcurl to scrape just the new files like, https://stackoverflow.com/questions/22235421/using-r-to-download-newest-files-from-ftp-server#answers-header — SeaGo, Oct 31 '19 at 21:48
@seago I sent you a message on another platform... Check messages — Carl Boneri, Oct 31 '19 at 23:04

JBGruber · Answer 2 · 2019-10-30T00:17:59.330

This works for me:

library(rvest)

crawler <- function(base_url) {

  get_links <- function(url) {
    read_html(url) %>% 
      html_nodes("a") %>% 
      html_attr("href") %>% 
      grep("../", ., fixed = TRUE, value = TRUE, invert = TRUE) %>% 
      url_absolute(url)
  }

  links <- base_url
  counter <- 1

  while (sum(grepl("txt$", links)) != length(links)) {
    links <- unlist(lapply(links, get_links))
    message("scraping level ", counter, " [", length(links), " links]")
    counter <- counter + 1
  }

  return(links)

}

txts <- crawler("https://ais.sbarc.org/logs_delimited/")

It looks like it's giving up on level 3, but this is just because there are so many links to go through.

Once you have all the txt urls, you can use this to read in the files:

library(dplyr)
library(data.table)

df <- lapply(txts, fread, fill = TRUE) %>% 
  rbindlist() %>% 
  as_tibble()

I would definitly do this in two steps as it will run for quite a while and it makes sense to save intermediate results (i.e., the links).

You can also try to run this in parallel if you want (cl is the number of cores to use):

library(pbapply)             

df <- pblapply(txts[1:10], fread, fill = TRUE, cl = 3) %>% 
  rbindlist() %>% 
  as_tibble()

Should be a little faster and you also get a nice progress bar.

I like your naming conventions, sir. – Carl Boneri Oct 29 '19 at 23:31 — Carl Boneri, Oct 29 '19 at 23:31

Webscrape text files using R, rvest or rcurl

2 Answers2

For reading in data...

Edit 2