0

I want to download a number of .txt-files. I have a data frame'"New_test in which the urls are under 'url' and the dest. names under 'code

"New_test.txt"

"url"   "code"
"1" "http://documents.worldbank.org/curated/en/704931468739539459/text/multi-page.txt" "704931468739539459.txt"
"2" "http://documents.worldbank.org/curated/en/239491468743788559/text/multi-page.txt"  "239491468743788559.txt"
"3" "http://documents.worldbank.org/curated/en/489381468771867920/text/multi-page.txt"  "489381468771867920.txt"
"4" "http://documents.worldbank.org/curated/en/663271468778456388/text/multi-page.txt"  "663271468778456388.txt"
"5" "http://documents.worldbank.org/curated/en/330661468742793711/text/multi-page.txt"  "330661468742793711.txt"
"6" "http://documents.worldbank.org/curated/en/120441468766519490/text/multi-page.txt"  "120441468766519490.txt"
"7" "http://documents.worldbank.org/curated/en/901481468770727038/text/multi-page.txt"  "901481468770727038.txt"
"8" "http://documents.worldbank.org/curated/en/172351468740162422/text/multi-page.txt"  "172351468740162422.txt"
"9" "http://documents.worldbank.org/curated/en/980401468740176249/text/multi-page.txt"  "980401468740176249.txt"
"10" "http://documents.worldbank.org/curated/en/166921468759906515/text/multi-page.txt" "166921468759906515.txt"
"11" "http://documents.worldbank.org/curated/en/681071468781809792/text/DRD169.txt" "681071468781809792.txt"
"12" "http://documents.worldbank.org/curated/en/358291468739333041/text/multi-page.txt" "358291468739333041.txt"
"13" "http://documents.worldbank.org/curated/en/716041468759870921/text/multi0page.txt" "716041468759870921.txt"
"14" "http://documents.worldbank.org/curated/en/961101468763752879/text/34896.txt"  "961101468763752879.txt"`

this is the script

rm(list=ls())

require(quanteda)
library(stringr)

workingdir <-setwd("~/Study/Master/Thesis/Mining/R/WorldBankDownl") 
test <- read.csv(paste0(workingdir,"/New_test.txt"), header = TRUE, 
stringsAsFactors = FALSE, sep="\t")

#Loop through every url in test_df and download in target directory with name = code
 for (url in test) {
 print(head(url))
 print(head(test$code))
 destfile <- paste0('~/Study/Master/Thesis/Mining/R/WorldBankDownl/Sources/', test$code)
 download.file(test$url, destfile,  method = "wget", quiet=TRUE)

And this is the error I get

Error in download.file(test$url, destfile, method = "wget", quiet = TRUE) : 
'url' must be a length-one character vector
Mel Schickel
  • 47
  • 1
  • 8
  • try changing: "for (url in test)" to: "for url in test$code" and also "download.file(test$url" to "download.file(url" – Chris Mar 20 '18 at 15:14
  • Thank you for your reply. "for (url in test$code)", -without the ( the loop doesn't work- gives the error: Error in download.file(url, destfile, method = "wget", quiet = TRUE) : 'destfile' must be a length-one character vector – Mel Schickel Mar 20 '18 at 15:59
  • u also need to change the second part i mentioned (i.e., change download.file(test$url to download.file(url – Chris Mar 20 '18 at 16:06
  • download.file(url, destfile, method = "wget", quiet = TRUE) – Mel Schickel Mar 20 '18 at 16:12
  • Sorry I was interrupted, yes I had changed the second part as well – Mel Schickel Mar 20 '18 at 16:21

2 Answers2

0

Here's one way to do it that is a bit simpler. You will need to substitute your test$url for the txturls (both are character vectors with the URL of the text file).

txturls <- c("http://documents.worldbank.org/curated/en/704931468739539459/text/multi-page.txt", 
             "http://documents.worldbank.org/curated/en/239491468743788559/text/multi-page.txt",
             "http://documents.worldbank.org/curated/en/489381468771867920/text/multi-page.txt")

library("quanteda")

txt <- character()
for (i in txturls) {
    # read the file from the URL
    temp <- readLines(url("http://documents.worldbank.org/curated/en/704931468739539459/text/multi-page.txt"))
    # concatenate lines into one text
    temp <- texts(temp, groups = 1)
    # remove form feed character
    txt <- gsub("\\f", "", txt)
    # concatenate into the vector
    txt <- c(txt, temp)
}

# form the quanteda corpus
urlcorp <- corpus(txt, docvars = data.frame(source = txturls, stringsAsFactors = FALSE))
summary(urlcorp)
# Corpus consisting of 3 documents:
# 
#  Text Types Tokens Sentences                                                                           source
#     1  1343   5125       135 http://documents.worldbank.org/curated/en/704931468739539459/text/multi-page.txt
#   1.1  1343   5125       135 http://documents.worldbank.org/curated/en/239491468743788559/text/multi-page.txt
#   1.2  1343   5125       135 http://documents.worldbank.org/curated/en/489381468771867920/text/multi-page.txt
# 
# Source: /Users/kbenoit/Dropbox (Personal)/GitHub/dictionaries_paper/* on x86_64 by kbenoit
# Created: Tue Mar 20 15:51:05 2018
# Notes: 
Ken Benoit
  • 14,454
  • 27
  • 50
  • Thank you. I assigned test to txturl. test <- read.csv("~/Study/Master/Thesis/Mining/R/WorldBankDownl/total_txt.txt", header = TRUE, stringsAsFactors = FALSE, sep="\t") txturls <- c(test) Then I got the error Error: nrow(docvars) == length(x) is not TRUE – Mel Schickel Mar 20 '18 at 16:56
  • I meant you want to use `test$url` where I have used `txturls`. The object over whose elements you are looping in the for loop needs to be a character vector containing the URLs of the files to be read. – Ken Benoit Mar 20 '18 at 17:25
  • I tried this, and it didn't work. I now assume that it was due to overcalling the source. In the meantime, I found a solution to my problem, but your solution looks neater. So I'll try to implement this. – Mel Schickel Mar 21 '18 at 14:27
0

Everyone, thank you for helping me. For me the solution was changing the method I used. 'wget' demanded 1 url in 'url' and the same for 'destfile'. 'A length-one character vector' Both 'url' and 'destfile' are length-fourteen character vectors. Now I use the method 'libcurl', which demands that the length of the character vectors is equal in 'url' and 'destfile'. If you use this method, make sure that quiet = TRUE.

Furthermore, it is possible that you have a working loop but that you get the error.

Error in download.file(test$url, destfile, method = "libcurl", quiet = TRUE) : 
cannot download any files
In addition: There were 50 or more warnings (use warnings() to see the first 50) 

This means that your source can't keep up with your calls, you're basically DDOS'ing, so the loop has to be slowed down.

rm(list=ls())

require(quanteda)
library(stringr)

workingdir <-setwd("~/Study/Master/Thesis/Mining/R/WorldBankDownl") 
test <- read.csv(paste0(workingdir,"/New_test.txt"), header = TRUE, 
stringsAsFactors = FALSE, sep="\t")

test <- data.frame(test)


#Loop through every url in comb_df and download in target directory with name = code, if you get an error and no files are downloaded try to slow down the loop.
for (url in test) {
 print(head(url))
 destfile <- paste0(workingdir, '/Sources/WB_', test$code)
 download.file(test$url, destfile, method = "libcurl", quiet = TRUE)
}
Mel Schickel
  • 47
  • 1
  • 8