1

I have to download multiple xlsx files about a country's census data from internet using R. Files are located in this Link .The problems are:

  1. I am unable to write a loop which will let me go back and forth to download
  2. File being download has some weird name not districts name. So how can I change it to districts name dynamically.

I have used the below mentioned codes: url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx" download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb")

But this downloads one file at a time and doesnt change the file name.

Thanks in advance.

user2504063
  • 23
  • 1
  • 2
  • 6
  • Do you have a list of file URLs or do you need to extract them from the link you provided? Please provide an example of "weird name": What *do* you get and what would you like to get? – CL. Aug 27 '15 at 08:25
  • I want to extract from the link provided. as you see the other answer helped me out a bit but it stopped after "Haryan's Karnal's " file. I don't know what went wrong over there. Try the other answer code, then you will find the error, I am talking about. – user2504063 Aug 28 '15 at 08:08

1 Answers1

1

Assuming you want all the data without knowing all of the urls, your questing involves webparsing. Package httr provides useful function for retrieving HTML-code of a given website, which you can parse for links.

Maybe this bit of code is what you're looking for:

library(httr)

base_url = "http://www.censusindia.gov.in/2011census/HLO/" # main website
r <- GET(paste0(base_url, "HL_PCA/Houselisting-housing-HLPCA.html"))
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\""))   # find links
rcl = rcl[grepl("Houselisting-housing-.+?\\.html", rcl)]  # find links to houslistings

names = gsub("^.+?>(.+?)</.+$", "\\1",rcl)              # get names
names = gsub("^\\s+|\\s+$", "", names)          # trim names
links = gsub("^(Houselisting-housing-.+?\\.html).+$", "\\1",rcl)  # get links

# iterate over regions
for(i in 1:length(links)) {
    url_hh = paste0(base_url, "HL_PCA/", links[i])
    if(!url_success(url_hh)) next

    r <- GET(url_hh)
    rc = content(r, "text")
    rcl = unlist(strsplit(rc, "<a href =\\\""))   # find links
  rcl = rcl[grepl(".xlsx", rcl)]  # find links to houslistings

    hh_names = gsub("^.+?>(.+?)</.+$", "\\1",rcl)          # get names
    hh_names = gsub("^\\s+|\\s+$", "", hh_names)          # trim names
    hh_links = gsub("^(.+?\\.xlsx).+$", "\\1",rcl)   # get links

    # iterate over subregions
    for(j in 1:length(hh_links)) {
        url_xlsx = paste0(base_url, "HL_PCA/",hh_links[j])
      if(!url_success(url_xlsx)) next

        filename = paste0(names[i], "_", hh_names[j], ".xlsx")
        download.file(url_xlsx, filename, mode="wb")
    }
}
MarkusN
  • 3,051
  • 1
  • 18
  • 26
  • The code you shared started working then I encountered an error: 'Error in download.file(url_xlsx, filename, mode = "wb") : cannot open destfile 'Haryana_HH_PCA1/HLPCA-06072-2011_H14_census.xlsx" style="text-decoration:none;color:black;">Kurukshetra .xlsx', reason 'Invalid argument'' But there is no difference in the code. Why is this not working? – user2504063 Aug 28 '15 at 08:01
  • Not all of the html-files are exactly the same. I improved the code slightly, have a try. – MarkusN Aug 28 '15 at 12:16
  • As you said, all files are not exactly the same. I again found the same error in "Daman & Diu". How do you differentiate between different html-files? Otherwise this code downloaded 8 rows, thanks for that. But it still require some tweaking. – user2504063 Aug 31 '15 at 05:16