I'm coding a data extraction tool to load data from Google Search Console (GSC from now on) and store it on an Amazon Redshift (AR from now on) database. I coded a function to parse the elements on the data frame coming from GSC to determine the field structure when creating tables on AR.
This is the R function I created:
get_table_fields <- function (d) {
r <- FALSE
if (is.data.frame(d)) {
r <- vector()
t <- d[1,]
c <- colnames(t)
for (k in c) {
v <- t[, k]
if (is.character(v)) {
r[k] <- "nvarchar(max)"
} else if (!is.na(as.Date(as.character(v), format = c("%Y-%m-%d")))) {
r[k] <- "date"
} else if (is.numeric(v)) {
r[k] <- ifelse(grepl(".", v, fixed = TRUE), "real", "integer")
}
}
}
return(r)
}
So far, so good. I pass the full data frame and the function extracts all relevant information from the first row, giving me the structure needed to create a table on AR.
This is the code I use to extract data from GSC and write it onto AR:
# retrieve the table fields schema
s_fields <- get_table_fields(data)
# compose the table creation definition out of the fields schema
d_fields <- paste(toString(sapply(names(s_fields), function (x) {
return(sprintf('"%s" %s', x, s_fields[x]))
})))
# compose the table creation query
c_query <- sprintf("CREATE TABLE IF NOT EXISTS %s (%s);", t_table_name, d_fields)
if (nrow(data) > 0) {
# create the table if it doesn't exist
dbSendUpdate(db, c_query)
# delete previous saved records for the specified date
dbSendUpdate(db, sprintf("DELETE FROM %s WHERE date = '%s' AND gsc_domain = '%s';", t_table_name, date_range[d], config.gsc.domain))
# upload the Google Search Console (GSC) data to Amazon Redshift (AR)
dbWriteTable(db, t_table_name, data, append = TRUE, row.names = FALSE)
}
db
is the database connection object, declated like this:
# initialize the Amazon Redshift JDBC driver
driver <- JDBC("com.amazon.redshift.jdbc42.Driver", "drivers/RedshiftJDBC42-1.2.16.1027.jar", identifier.quote = "`")
# connect to the Amazon Redshift database instance
db <- dbConnect(driver, sprintf("jdbc:redshift://%s:%s/%s?user=%s&password=%s", config.ar.host, config.ar.port, config.ar.database, config.ar.user, config.ar.password))
t_table_name
is a concatenated string with the different dimensions in the GSC extraction definition with gsc_by
as a prefix and joined with an underscore so, if we wanted to extract date, page and device, the table name would be gsc_by_date_page_device
So, basically, what this code does is gather a data frame from GSC, ensure the table for the specified extraction exists. If not, it creates it. Otherwise, it removes any existing data (in case the extraction is re-launched not to duplicate any entries) and stores it in AR.
The problem is it seems like either the AR database or the JDBC driver from Amazon Redshift is forcing my column definitions as varchar(255) instead of the nvarchar(max) or varchar(max) I'm trying to write. I've tried different combinations but the result is always the same:
<simpleError in .local(conn, statement, ...): execute JDBC update query failed in dbSendUpdate ([Amazon](500310) Invalid operation: Value too long for character type
Details:
-----------------------------------------------
error: Value too long for character type
code: 8001
context: Value too long for type character varying(255)
query: 116225
location: funcs_string.hpp:395
process: padbmaster [pid=29705]
-----------------------------------------------;)>
If I print the c_query
variable (the table creation query) before sending the query, it prints out correctly:
CREATE TABLE IF NOT EXISTS gsc_by_date_query_device ("date" date, "query" nvarchar(max), "device" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" integer, "gsc_domain" nvarchar(max));
CREATE TABLE IF NOT EXISTS gsc_by_date_query_country_device ("date" date, "query" nvarchar(max), "country" nvarchar(max), "device" nvarchar(max), "countryName" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" integer, "gsc_domain" nvarchar(max));
CREATE TABLE IF NOT EXISTS gsc_by_date_page_device ("date" date, "page" nvarchar(max), "device" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" real, "gsc_domain" nvarchar(max));
If I execute this on SQLWorkbench/J (the tool I'm using for checking), it creates the table correctly and even with that, what is failing is the data insertion.
Can you give me a hint on what am I doing wrong or how can I specify the text columns as bigger than 256 characters? I'm having a nightmare with this and I think I've tried everything I could.