-1

I'm trying to scrape links to all minutes and agenda provided in this website: https://www.charleston-sc.gov/AgendaCenter/

I've managed to scrape section IDs associated with each category (and years for each category) to loop through the contents within each category-year (please see below). But I don't know how to scrape the hrefs that lives inside the contents. Especially because the links to Agenda lives inside the drop down menu under 'download', it seems like I need to go through extra clicks to scrape the hrefs.

How do I scrape the minutes and agenda (inside the download dropdown) for each table I select? Ideally, I would like a table with the date, title of the agenda, links to minutes, and links to agenda.

I'm using RSelenium for this. Please see the code I have so far below, which allows me to click through each category and year, but not else much. Please help!

rm(list = ls())
library(RSelenium)
library(tidyverse)
library(httr)
library(XML)
library(stringr)
library(RCurl)

t  <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')
co <- str_match(t, 'aria-label="(.*?)"[ ]href="java')[,2] 
yr <- str_match(t, 'id="(.*?)" aria-label')[,2]

df <- data.frame(cbind(co, yr)) %>%
  mutate_all(as.character) %>%
  filter_all(any_vars(!is.na(.))) %>%
  mutate(id = ifelse(grepl('^a0', yr), gsub('a0', '', yr), NA)) %>%
  tidyr::fill(c(co,id), .direction='down')%>% drop_na(co)

remDr <- remoteDriver(port=4445L, browserName = "chrome")
remDr$open()
remDr$navigate('https://www.charleston-sc.gov/AgendaCenter/')
remDr$screenshot(display = T)

for (j in unique(df$id)){
  remDr$findElement(using = 'xpath', 
                  value = paste0('//*[@id="cat',j,'"]/h2'))$clickElement()
  
  for (k in unique(df[which(df$id==j),'yr'])){
    remDr$findElement(using = 'xpath', 
                  value = paste0('//*[@id="',k,'"]'))$clickElement()
    # NEED TO SCRAPE THE HREF ASSOCIATED WITH MINUTES AND AGENDA DOWNLOAD HERE #
  }
}
Les D
  • 63
  • 6
  • 1
    What is `url[i,'url']`? It seems `i` is not defined in your code – Vasily A Oct 13 '20 at 03:48
  • It should be replaced with 'https://www.charleston-sc.gov/AgendaCenter/' - I'll update my code – Les D Oct 13 '20 at 13:36
  • yes make sure your code is fully working and reproducible in a clean environment: right now there are some typos (in the first line, extra `)` before `, encoding`) and libraries used but not declared (`stringr`, `dplyr` etc) – Vasily A Oct 13 '20 at 18:21
  • The libraries have been declared and typos fixed. It runs on my end. Let me know if the code still runs into an issue. – Les D Oct 14 '20 at 19:22

1 Answers1

0

Maybe you don't really need to click through all the elements? You can use the fact that all downloadable links have ViewFile in their href:

t  <- readLines('https://www.charleston-sc.gov/AgendaCenter/', encoding = 'UTF-8')

viewfile <- str_extract_all(t, '.*ViewFile.*', simplify = T)
viewfile <- viewfile[viewfile!='']

library(data.table) # I use data.table because it's more convenient - but can be done without too
dt.viewfile <- data.table(origStr=viewfile)

# list the elements and patterns we will be looking for:
searchfor <- list(
  Title='name=[^ ]+ title=\"(.+)\" href',
  Date='<strong>(.+)</strong>',
  href='href=\"([^\"]+)\"',
  label= 'aria-label=\"([^\"]+)\"'
)

for (this.i in names(searchfor)){
  this.full <- paste0('.*',searchfor[[this.i]],'.*');
  dt.viewfile[grepl(this.full, origStr), (this.i):=gsub(this.full,'\\1',origStr)]
}

# Clean records:
dt.viewfile[, `:=`(Title=na.omit(Title),Date=na.omit(Date),label=na.omit(label)),
      by=href]
dt.viewfile[,Date:=gsub('<abbr title=".*">(.*)</abbr>','\\1',Date)]
dt.viewfile <- unique(dt.viewfile[,.(Title,Date,href,label)]); # 690 records

What you have as the result is a table with the links to all downloadable files. You can now download them using any tool you like, for example using download.file() or GET():

dt.viewfile[, full.url:=paste0('https://www.charleston-sc.gov', href)]
dt.viewfile[, filename:=fs::path_sanitize(paste0(Title, ' - ', Date), replacement = '_')]

for (i in seq_len(nrow(dt.viewfile[1:10,]))){ # remove `1:10` limitation to process all records
  url <- dt.viewfile[i,full.url]
  destfile <- dt.viewfile[i,filename]
  
  cat('\nDownloading',url, ' to ', destfile)
  
  
  fil <- GET(url, write_disk(destfile))
  
  # our destination file doesn't have extension, we need to get it from the server:
  serverFilename <- gsub("inline;filename=(.*)",'\\1',headers(fil)$`content-disposition`)
  serverExtension <- tools::file_ext(serverFilename)
  
  # Adding the extension to the file we just saved
  file.rename(destfile,paste0(destfile,'.',serverExtension))
  
}

Now the only problem we have is that the original webpage was only showing records for the last 3 years. But instead of clicking View More through RSelenium, we can simply load the page with earlier dates, something like this:

t  <- readLines('https://www.charleston-sc.gov/AgendaCenter/Search/?term=&CIDs=all&startDate=10/14/2014&endDate=10/14/2017', encoding = 'UTF-8')

then repeat the rest of the code as necessary.

Vasily A
  • 8,256
  • 10
  • 42
  • 76