0

I am trying to load a csv file from s3 to redshift table using python. I have used boto3 to pull data from s3. Used pandas to convert data types (timestamp, string and integer) and tried to upload the dataframe to table using to_sql (sqlalchemy). It ended up with error

cursor.executemany(statement, parameters) psycopg2.errors.StringDataRightTruncation: value too long for type character varying(256)".

Additional Info: string contains large amount of mixed data. Also I am able to take the output as csv in my local machine.

My code as follows,

import io
import boto3
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime 

client = boto3.client('s3', aws_access_key_id="",
            aws_secret_access_key="")

response = client.get_object(Bucket='', Key='*.csv')
file = response['Body'].read()
df = pd.read_csv(io.BytesIO(file))

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True) 
df['text'] = df['text'].astype(str)
df['count'] = df['count'].fillna(0).astype(int)

con = create_engine('postgresql://*.redshift.amazonaws.com:5439/dev')
select_list = ['date','text','count']
write = df[select_list]
df = pd.DataFrame(write)
df.to_sql('test', con, schema='parent', index=False, if_exists='replace')

I am a beginner, please help me to understand what I am doing wrong. Ignore any typo errors. Thanks.

Abinav R
  • 365
  • 2
  • 16
Dinesh
  • 21
  • 5
  • Is the data you are loading longer than 256 characters? The error seems to tell you that the data is longer than the defined column size. Please note that Redshift stores string data in multi-byte UTF-8 format which means that non-ASCII character will require more than character "size" in the column. If you have such characters the Redshift defined length will need to increase to fit these multi-byte characters – Bill Weiner Aug 24 '21 at 13:03
  • Yes. Data is too lengthy. I assume that the error is because of the reason that you have explained. Any suggestion to increase the redshift defined length? Ps: I have tried astype **('|S')** instead of **(str)** and got the error _UnicodeEncodeError: 'ascii' codec can't encode character '\u2018' in position 354: ordinal not in range(128)_. Applied several solutions for this error and nothing works. Thanks – Dinesh Aug 26 '21 at 01:50
  • If the issue is just the difference between string length and byte length (octet_length()) of utf-8 characters, then the fix is to increase the varchar length. The question is by how much. Each ASCII character will be 1 byte, most if not all European language special characters will be 2 bytes, Japanese and Chinese characters are 2 and 3 bytes (I think but I'm not an expert), and 4-byte characters are for other oriental languages and emijis (again rough understanding not perfect). So if you have a few accented chars then upping the size by 1.5 should work but if all Japanese then 3X. – Bill Weiner Aug 26 '21 at 21:05
  • Now if your need is to convince the tools you are using to up these sizes then I'm not likely the best resource. – Bill Weiner Aug 26 '21 at 21:06

0 Answers0