Amazon Redshift - table columns declared as varchar(max) but forced as varchar(255)

Question

I'm coding a data extraction tool to load data from Google Search Console (GSC from now on) and store it on an Amazon Redshift (AR from now on) database. I coded a function to parse the elements on the data frame coming from GSC to determine the field structure when creating tables on AR.

This is the R function I created:

get_table_fields <- function (d) {
  r <- FALSE

  if (is.data.frame(d)) {
    r <- vector()
    t <- d[1,]
    c <- colnames(t)

    for (k in c) {
      v <- t[, k]

      if (is.character(v)) {
        r[k] <- "nvarchar(max)"
      } else if (!is.na(as.Date(as.character(v), format = c("%Y-%m-%d")))) {
        r[k] <- "date"
      } else if (is.numeric(v)) {
        r[k] <- ifelse(grepl(".", v, fixed = TRUE), "real", "integer")
      }
    }
  }

  return(r)
}

So far, so good. I pass the full data frame and the function extracts all relevant information from the first row, giving me the structure needed to create a table on AR.

This is the code I use to extract data from GSC and write it onto AR:

# retrieve the table fields schema
s_fields <- get_table_fields(data)

# compose the table creation definition out of the fields schema
d_fields <- paste(toString(sapply(names(s_fields), function (x) {
  return(sprintf('"%s" %s', x, s_fields[x]))
})))

# compose the table creation query
c_query <- sprintf("CREATE TABLE IF NOT EXISTS %s (%s);", t_table_name, d_fields)

if (nrow(data) > 0) {
  # create the table if it doesn't exist
  dbSendUpdate(db, c_query)

  # delete previous saved records for the specified date
  dbSendUpdate(db, sprintf("DELETE FROM %s WHERE date = '%s' AND gsc_domain = '%s';", t_table_name, date_range[d], config.gsc.domain))

  # upload the Google Search Console (GSC) data to Amazon Redshift (AR)
  dbWriteTable(db, t_table_name, data, append = TRUE, row.names = FALSE)
}

db is the database connection object, declated like this:

# initialize the Amazon Redshift JDBC driver
driver <- JDBC("com.amazon.redshift.jdbc42.Driver", "drivers/RedshiftJDBC42-1.2.16.1027.jar", identifier.quote = "`")

# connect to the Amazon Redshift database instance
db <- dbConnect(driver, sprintf("jdbc:redshift://%s:%s/%s?user=%s&password=%s", config.ar.host, config.ar.port, config.ar.database, config.ar.user, config.ar.password))

t_table_name is a concatenated string with the different dimensions in the GSC extraction definition with gsc_by as a prefix and joined with an underscore so, if we wanted to extract date, page and device, the table name would be gsc_by_date_page_device

So, basically, what this code does is gather a data frame from GSC, ensure the table for the specified extraction exists. If not, it creates it. Otherwise, it removes any existing data (in case the extraction is re-launched not to duplicate any entries) and stores it in AR.

The problem is it seems like either the AR database or the JDBC driver from Amazon Redshift is forcing my column definitions as varchar(255) instead of the nvarchar(max) or varchar(max) I'm trying to write. I've tried different combinations but the result is always the same:

<simpleError in .local(conn, statement, ...): execute JDBC update query failed in dbSendUpdate ([Amazon](500310) Invalid operation: Value too long for character type
Details:
-----------------------------------------------
error:  Value too long for character type
code:      8001
context:   Value too long for type character varying(255)
query:     116225
location:  funcs_string.hpp:395
process:   padbmaster [pid=29705]
-----------------------------------------------;)>

If I print the c_query variable (the table creation query) before sending the query, it prints out correctly:

CREATE TABLE IF NOT EXISTS gsc_by_date_query_device ("date" date, "query" nvarchar(max), "device" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" integer, "gsc_domain" nvarchar(max));
CREATE TABLE IF NOT EXISTS gsc_by_date_query_country_device ("date" date, "query" nvarchar(max), "country" nvarchar(max), "device" nvarchar(max), "countryName" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" integer, "gsc_domain" nvarchar(max));
CREATE TABLE IF NOT EXISTS gsc_by_date_page_device ("date" date, "page" nvarchar(max), "device" nvarchar(max), "clicks" integer, "impressions" integer, "ctr" real, "position" real, "gsc_domain" nvarchar(max));

If I execute this on SQLWorkbench/J (the tool I'm using for checking), it creates the table correctly and even with that, what is failing is the data insertion.

Can you give me a hint on what am I doing wrong or how can I specify the text columns as bigger than 256 characters? I'm having a nightmare with this and I think I've tried everything I could.

can you post the result of this query when run in SQLWorkbench. "select table_schema,table_name,column_name,character_maximum_length from information_schema.columns where table_name='gsc_by_date_query_device';".Also,run your CREATE table statements before you run this query — theDbGuy, Sep 17 '18 at 12:34
@TonyGibbs I referenced it in the code samples: RedshiftJDBC42-1.2.16.1027.jar, latest available from Amazon. — Julio María Meca Hansen, Sep 18 '18 at 13:24
@theDbGuy the table definition looks okey, but I still get the same error — Julio María Meca Hansen, Sep 18 '18 at 13:39
@TonyGibbs anyway, I tried all three drivers and all of them behave the same :( — Julio María Meca Hansen, Sep 18 '18 at 15:01

score 0 · Answer 1 · answered Sep 21 '18 at 13:50

I've written an extensive blogpost explaining a lot of nuances of reading/writing data to/from Amazon Redshift: https://auth0.com/blog/a-comprehensive-guide-for-connecting-with-r-to-redshift/

In particular, the best way to read data with R is using the RPostgres library, and to write data i recommend using the R Package i created: https://github.com/sicarul/redshiftTools

In particular, it does not have the issue you are reporting, varchars are created based on the length of the strings using function calculateCharSize: https://github.com/sicarul/redshiftTools/blob/master/R/table_definition.R#L2

Though, as a best practice i'd say unless it's a temporary or staging table, try to always create the table yourself, that way you can control sortkeys, distkeys and compression, those are very important for performance in Amazon Redshift.

If you already have created the table, you can do something like:

rs_replace_table(data, dbcon=db, table_name=t_table_name, bucket="mybucket", split_files=4)

If you haven't created the table, you can do practically the same thing with rs_create_table

You'll need an S3 bucket and the AWS keys to access it, since this package uploads to S3 and then directs redshift to that bucket, it's the fastest way to bulk upload data.

@Achilleus i wrote this more than 3 years ago, trying to help with uploading tables into Amazon Redshift (I'm now using Snowflake), why do you think my answer was not helpful? — Sicarul, Feb 17 '22 at 21:21

Amazon Redshift - table columns declared as varchar(max) but forced as varchar(255)

1 Answers1