3

I'm using the following code in R to download data from Google Trends, that I took mostly from here http://christophriedl.net/2013/08/22/google-trends-with-r/

    ############################################
##    Query GoogleTrends from R
##
## by Christoph Riedl, Northeastern University
## Additional help and bug-fixing re cookies by
## Philippe Massicotte Université du Québec à Trois-Rivières (UQTR)
############################################

# Load required libraries
library(RCurl)      # For getURL() and curl handler / cookie / google login
library(stringr)    # For str_trim() to trip whitespace from strings
# Google account settings
username <- "USERNAME"
password <- "PASSWORD"

# URLs
loginURL        <- "https://accounts.google.com/accounts/ServiceLogin"
authenticateURL <- "https://accounts.google.com/accounts/ServiceLoginAuth"
trendsURL       <- "http://www.google.com/trends/TrendsRepport?"

############################################
## This gets the GALX cookie which we need to pass back with the login form
############################################
getGALX <- function(curl) {
  txt = basicTextGatherer()
  curlPerform( url=loginURL, curl=curl, writefunction=txt$update, header=TRUE, ssl.verifypeer=FALSE )

  tmp <- txt$value()

  val <- grep("Cookie: GALX", strsplit(tmp, "\n")[[1]], val = TRUE)
  strsplit(val, "[:=;]")[[1]][3]

  return( strsplit( val, "[:=;]")[[1]][3]) 
}


############################################
## Function to perform Google login and get cookies ready
############################################
gLogin <- function(username, password) {
  ch <- getCurlHandle()

  ans <- (curlSetOpt(curl = ch,
                     ssl.verifypeer = FALSE,
                     useragent = getOption('HTTPUserAgent', "R"),
                     timeout = 60,         
                     followlocation = TRUE,
                     cookiejar = "./cookies",
                     cookiefile = ""))

  galx <- getGALX(ch)
  authenticatePage <- postForm(authenticateURL, .params=list(Email=username, Passwd=password, GALX=galx, PersistentCookie="yes", continue="http://www.google.com/trends"), curl=ch)

  authenticatePage2 <- getURL("http://www.google.com", curl=ch)

  if(getCurlInfo(ch)$response.code == 200) {
    print("Google login successful!")
  } else {
    print("Google login failed!")
  }
  return(ch)
}

##

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

get_interest_over_time <- function(res, clean.col.names = TRUE) {
  # remove all text before "Interest over time" data block begins
  data <- gsub(".*Interest over time", "", res)

  # remove all text after "Interest over time" data block ends
  data <- gsub("\n\n.*", "", data)

  # convert "interest over time" data block into data.frame
  data.df <- read.table(text = data, sep =",", header=TRUE)

  # Split data range into to only end of week date 
  data.df$Week <- gsub(".*\\s-\\s", "", data.df$Week)
  data.df$Week <- as.Date(data.df$Week)

  # clean column names
  if(clean.col.names == TRUE) colnames(data.df) <- gsub("\\.\\..*", "", colnames(data.df))


  # return "interest over time" data.frame
  return(data.df)
}

############################################
## Read data for a query
############################################
ch <- gLogin( username, password )
authenticatePage2 <- getURL("http://www.google.com", curl=ch)

res <- getForm(trendsURL, q="sugar", geo="US", content=1, export=1, graph="all_csv", curl=ch)
# Check if quota limit reached
if( grepl( "You have reached your quota limit", res ) ) {
  stop( "Quota limit reached; You should wait a while and try again lateer" )
}
df <- get_interest_over_time(res)
head(df)

write.csv(df,"sugar.csv")

When I search just for the US, or any single country, everything works fine, but I need more disagreggated data, at Metropolitan Area. However, I cannot get those queries to work with this script. Whenever I do it, by typing, for example "US-IL" in the geo field, I get an error:

Error in read.table(text = data, sep = ",", header = TRUE) : 
more columns than column names 

The same happens if I try to do take a trend for a Metropolitan Area (using something like "US-IL-602" for Chicago, for example). Does anyone know how could I modify this script to make it work?

Thank you very much,

Brian.

Brian Feld
  • 27
  • 3
  • Is this an R question or a google API question? Does it work in Google's query tester widget? – C8H10N4O2 Jul 03 '15 at 02:28
  • Well, it seems to be more of an API question, since the code itself runs fine as it is, but once I introduce a little change, it crashes. I couldn't find the correct way to introduce more detalied geographic references in the queries, and what I've tried didn't work so far. – Brian Feld Jul 03 '15 at 14:38
  • This is only tangentially related, but do you know how the Google metropolitan areas are? I'm trying to match to other data sources, and I'm not exactly sure. They have their geographies delineated on a map and they should have shapefiles for them to be able to match. – robin.datadrivers Aug 12 '16 at 17:53
  • @robin.datadrivers I think they're Nielsen Designated Media Areas. – ASGM Dec 08 '20 at 16:59

0 Answers0