0

I have a data.frame (dim: 100 x 1) containing a list of url links, each url looks something like this: https:blah-blah-blah.com/item/123/index.do .

The list (the list is a data.frame called my_list with 100 rows and a single column named col and is in character format $ col: chr) together looks like this :

 1 "https:blah-blah-blah.com/item/123/index.do"
 2" https:blah-blah-blah.com/item/124/index.do"
 3 "https:blah-blah-blah.com/item/125/index.do"

etc.

I am trying to import each of these url's into R and collectively save the object as an object that is compatible for text mining procedures.

I know how to successfully convert each of these url's (that are on the list) manually:

library(pdftools)
library(tidytext)
library(textrank)
library(dplyr)
library(tm)

#1st document
url <- "https:blah-blah-blah.com/item/123/index.do"

article <- pdf_text(url)

Once this "article" file has been successfully created, I can inspect it:

str(article)

chr [1:13] 

It looks like this:

[1] "abc ....."

[2] "def ..."

etc etc

[15] "ghi ...:

From here, I can successfully save this as an RDS file:

saveRDS(article, file = "article_1.rds")

Is there a way to do this for all 100 articles at the same time? Maybe with a loop?

Something like :

for (i in 1:100) {

url_i <- my_list[i,1]

article_i <- pdf_text(url_i)

saveRDS(article_i, file = "article_i.rds")

}

If this was written correctly, it would save each article as an RDS file (e.g. article_1.rds, article_2.rds, ... article_100.rds).

Would it then be possible to save all these articles into a single rds file?

stats_noob
  • 5,401
  • 4
  • 27
  • 83

3 Answers3

2

Please note that list is not a good name for an object, as this will temporarily overwrite the list() function. I think it is usually good to name your variables according to their content. Maybe url_df would be a good name.

library(pdftools)
#> Using poppler version 20.09.0
library(tidyverse)

url_df <-
  data.frame(
    url = c(
      "https://www.nimh.nih.gov/health/publications/autism-spectrum-disorder/19-mh-8084-autismspecdisordr_152236.pdf",
      "https://www.nimh.nih.gov/health/publications/my-mental-health-do-i-need-help/20-mh-8134-mymentalhealth-508_161032.pdf"
    )
  )

Since the urls are already in a data.frame we could store the text data in an aditional column. That way the data will be easily available for later steps.

text_df <- 
  url_df %>% 
  mutate(text = map(url, pdf_text))

Instead of saving each text in a separate file we can now store all of the data in a single file:

saveRDS(text_df, "text_df.rds")

For historical reasons for loops are not very popular in the R community. base R has the *apply() function family that provides a functional approach to iteration. The tidyverse has the purrr package and the map*() functions that improve upon the *apply() functions.

I recommend taking a look at https://purrr.tidyverse.org/ to learn more.

Till
  • 3,845
  • 1
  • 11
  • 18
  • Each of these url's end in "index.do" - could this be the problem? – stats_noob Apr 09 '21 at 21:23
  • thank you for your answer! i tried your code and got the following errors: text_df <- url_df %>% mutate(text = map(url, pdf_text)) PDF error: May not be a PDF file (continuing anyway) PDF error (2): Illegal character <21> in hex string PDF error (4): Illegal character <4f> in hex string ..etc etc etc... PDF error: Couldn't find trailer dictionary PDF error: Couldn't read xref table Error: Problem with `mutate()` input `text`. x PDF parsing failure. i Input `text` is `map(url, pdf_text)`. – stats_noob Apr 09 '21 at 21:26
  • 1
    One or more of your urls is incorrect. Try encapsulating pdf_text in a function, printing url before you pdf_text it, so you can see which one is failing – Dennis Apr 10 '21 at 07:27
  • 1
    @Noob it doesn't matter what the url is, as long as it returns something that can be parsed by pdf_text – Dennis Apr 10 '21 at 07:29
  • 1
    Choosing names by describing the contents is good advice. But overriding existing names in a small scope is completely unproblematic, and thereʼs nothing wrong with that. (In R in particular you can even continue to use the `list` function even when youʼre using that name for a local variable!) – Konrad Rudolph Apr 10 '21 at 09:20
  • @KonradRudolph : I changed the name "list" to "my_list" – stats_noob Apr 10 '21 at 15:58
  • @Dennis : "Try encapsulating pdf_text in a function, printing url before you pdf_text it, so you can see which one is failing " ... can you please show me how to do this? – stats_noob Apr 10 '21 at 15:59
  • 1
    @Noob The point of my comment was precisely that this is unnecessary. And `my_list` in particular is *worse* than plain `list`: it’s a completely non-descriptive name. The prefix `my_` is pure visual clutter because it provides no useful information. – Konrad Rudolph Apr 10 '21 at 16:01
1

So say you have a data.frame called my_df with a column that contains your URLs of pdf locations. As by your comments, it seems that some URLs lead to broken PDFs. You can use tryCatch in these cases to report back which links were broken and check manually what's wrong with these links.

You can do this in a for loop like this:

my_df <- data.frame(url = c(
  "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # working pdf
  "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pfd" # broken pdf
))

# make some useful new columns
my_df$id <- seq_along(my_df$url)
my_df$status <- NA

for (i in my_df$id) {
  
  my_df$status[i] <- tryCatch({
    
    message("downloading ", i) # put a status message on screen
    article_i <- suppressMessages(pdftools::pdf_text(my_df$url[i]))
    saveRDS(article_i, file = paste0("article_", i, ".rds"))
    "OK"
    
  }, error = function(e) {return("FAILED")}) # return the string FAILED if something goes wrong
  
}
my_df$status
#> [1] "OK"     "FAILED"

I included a broken link in the example data on purpose to showcase how this would look.

Alternatively, you can use a loop from the apply family. The difference is that instead of iterating through a vector and applying the same code until the end of the vector, *apply takes a function, applies it to each element of a list (or objects which can be transformed to lists) and returns the results from each iteration in one go. Many people find *apply functions confusing at first because usually people define and apply functions in one line. Let's make the function more explicit:

s_download_pdf <- function(link, id) {
  tryCatch({
    message("downloading ", id) # put a status message on screen
    article_i <- suppressMessages(pdftools::pdf_text(link))
    saveRDS(article_i, file = paste0("article_", id, ".rds"))
    "OK"
    
  }, error = function(e) {return("FAILED")})
}

Now that we have this function, let's use it to download all files. I'm using mapply which iterates through two vectors at once, in this case the id and url columns:

my_df$status <- mapply(s_download_pdf, link = my_df$url, id = my_df$id)
my_df$status
#> [1] "OK"     "FAILED"

I don't think it makes much of a difference which approach you choose as the speed will be bottlenecked by your internet connection instead of R. Just thought you might appreciate the comparison.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • thank you for your answer! does this first require you to create an object called "articles"? – stats_noob Apr 09 '21 at 22:20
  • 1
    You just need a vector. I think you have a `data.frame` called `list` (don't use that name btw) with a column called `col`, which contains the links. In this case to get a vector, simply use `list$col`. You can use that instead of `article` and leave the rest of the code as is. – JBGruber Apr 10 '21 at 09:38
  • Thank you! I will try this! I changed it's name to "my_list" – stats_noob Apr 10 '21 at 18:58
  • I just tried your code and got the following error: PDF error: May not be a PDF file (continuing anyway) PDF error (127): Illegal character <22> in hex string PDF error: Couldn't find trailer dictionary PDF error: Couldn't find trailer dictionary PDF error: Couldn't read xref table Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure. – stats_noob Apr 11 '21 at 02:47
  • do you know what I am doing wrong? thank you – stats_noob Apr 11 '21 at 02:48
  • I think I understand better now where you are coming from and overhauled my answer completely. Have a look and let me know if it helped you. – JBGruber Apr 11 '21 at 08:30
1

It seems that there are certain url's in your data which are not valid pdf files. You can wrap it in tryCatch to handle the errors. If your dataframe is called df with url column in it, you can do :

library(pdftools)

lapply(seq_along(df$url), function(x) {
  tryCatch({
    saveRDS(pdf_text(df$url[x]), file = sprintf('article_%d.rds', x)),
  },error = function(e) {})
})
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I think there is a small typo in your code : file = sprintf('article_%d.rds', x)), ... I changed it to (removed the comma) : file = sprintf('article_%d.rds', x)) – stats_noob Apr 10 '21 at 15:55
  • Unfortunately, this code still did not work. I got errors like : PDF error: Couldn't find trailer dictionary PDF error: Couldn't find trailer dictionary PDF error: Couldn't read xref table , PDF error: May not be a PDF file (continuing anyway) , Warning messages: 1: In open.connection(con, "rb") : cannot open URL, In for (i in seq_along(specs)) { : closing unused connection 4 – stats_noob Apr 10 '21 at 16:01
  • The final file produced by your code looks like this : [[1]] NULL [[2]] NULL [[3]] NULL [[4]] NULL [[5]] NULL ... [[28]] NULL [[29]] NULL [[30]] NULL [[31]] NULL – stats_noob Apr 10 '21 at 16:03
  • You need to tell us the actual URL's, because it looks like the are *not* actually PDF files. – Dennis Apr 11 '21 at 17:59