5

I am reading data from Redshift using Pandas. I have one bigint (int8) column which is coming as exponential. I tried following ways, but getting data truncation in those cases.

Sample Value of data in that column is : 635284328055690862. It is reading as 6.352843e+17.

I tried to convert that into int64 in Python.

import numpy as np
df["column_name"] = df["column_name"].astype(np.int64)

Output in this case is : 635284328055690880. Here I am loosing my data, it is scaling it to 0 at the end.

Expected Output: 635284328055690862

Even, I am getting same result If I am doing this.

pd.set_option('display.float_format', lambda x: '%.0f' % x)

Output: 635284328055690880

Expected Output: 635284328055690862

It seems like this is normal Pandas behavior. I even tried creating a Dataframe using list and still getting the same.

import pandas as pd
import numpy as np

pd.set_option('display.float_format', lambda x: '%.0f' % x)
sample_data = [[635284328055690862, 758364950923147626], [np.NaN, np.NaN], [1, 3]]
df = pd.DataFrame(sample_data)


Output:
0 635284328055690880 758364950923147648
1                nan                nan
2                  1                  3

What I have noticed is, whenever we have nan in dataframe, we are having this issue.

I am using below code to fetch data from Redshift.

from sqlalchemy import create_engine 
import pandas as pd  
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>' 
engine = create_engine(connstr) 
with engine.connect() as conn, conn.begin():     
    df = pd.read_sql('''select * from schema.table_name''', conn)
print(df)

Please help me in fixing this. Thanks in Advance.

Xyz
  • 73
  • 7
  • If I understand the input correctly, I don't find any problem by doing the following : `data = pd.Series([635284328055690862]).astype(int)`. It outputs `635284328055690862` – Grayrigel Sep 16 '20 at 13:01
  • that is not a single data, it is one of the column value. @Grayrigel – Xyz Sep 16 '20 at 13:09
  • You can refer to the updated example, that's exactly same issue. – Xyz Sep 16 '20 at 13:10
  • Getting same output even by converting column to series: `data = pd.Series(df['column_name'].values).astype(np.int64)`. Here also output is same as mentioned in question. @Grayrigel – Xyz Sep 16 '20 at 13:17
  • Yeah. I see now. I am not sure why this is the case. – Grayrigel Sep 16 '20 at 13:21
  • How are you reading the data? I tried saving the sample data in the `csv` format and then, read using `pd.read_csv`, which works fine. Can you check if it works for you? – Grayrigel Sep 16 '20 at 13:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221576/discussion-between-grayrigel-and-lonewalker). – Grayrigel Sep 16 '20 at 13:42
  • I tried using creating Dataframe from list and reading data from Redshift Table. Just to let you know, we have this issue if we have `NULL`/`nan` value in data. @Grayrigel – Xyz Sep 16 '20 at 13:43

2 Answers2

3

This happens because standard integer datatypes do not provide a way to represent missing data. Since floating point datatypes do provide nan, the old way of handling this was to convert numerical columns with missing data to float.

To correct this, pandas has introduced a Nullable integer data type. If you were doing something as simple as reading a csv, you could explicitly specify this type in your call to read_csv like so:

>>> pandas.read_csv('sample.csv', dtype="Int64")
             column_a  column_b
0  635284328055690880     45564
1                <NA>        45
2                   1      <NA>
3                   1         5

However, the problem persists! It seems that even though 635284328055690862 can be represented as a 64-bit integer, at some point, pandas still passes the value through a floating-point conversion step, changing the value. This is pretty odd, and might even be worth raising as an issue with the pandas developers.

The best workaround I see in this scenario is to use the "object" datatype, like so:

>>> pandas.read_csv('sample.csv', dtype="object")
             column_a column_b
0  635284328055690862    45564
1                 NaN       45
2                   1      NaN
3                   1        5 

This preserves the exact value of the large integer and also allows for NaN values. However, because these are now arrays of python objects, there will be a significant performance hit for compute-intensive tasks. Furthermore, on closer examination, it appears that these are Python str objects, so we still need another conversion step. To my surprise, there was no straightforward approach. This was the best I could do:

def col_to_intNA(col):
    return {ix: pandas.NA if pandas.isnull(v) else int(v)
            for ix, v in col.to_dict().items()}

sample = {col: col_to_intNA(sample[col])
          for col in sample.columns}
sample = pandas.DataFrame(sample, dtype="Int64")

This gives the desired result:

>>> sample
             column_a  column_b
0  635284328055690862     45564
1                <NA>        45
2                   1      <NA>
3                   1         5
>>> sample.dtypes
column_a    Int64
column_b    Int64
dtype: object

So that solves one problem. But a second problem arises, because to read from a Redshift database, you would normally use read_sql, which doesn't provide any way to specify data types.

So we'll roll our own! This is based on the code you posted, as well as some code from the pandas_redshift library. It uses psycopg2 directly, rather than using sqlalchemy, because I am not sure sqlalchemy provides a cursor_factory parameter that accepts a RealDictCursor. Caveat: I have not tested this at all because I am too lazy to set up a postgres database just to test a StackOverflow answer! I think it should work but I am not certain. Please let me know whether it works and/or what needs to be corrected.

import psycopg2
from psycopg2.extras import RealDictCursor  # Turn rows into proper dicts.

import pandas

def row_null_to_NA(row):
    return {col: pandas.NA if pandas.isnull(val) else val
            for col, val in row.items()}

connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>'

try:  # `with conn:` only closes the transaction, not the connection 
    conn = psycopg2.connect(connstr, cursor_factory=RealDictCursor)
    cursor = conn.cursor()
    cursor.execute('''select * from schema.table_name''')

    # The DataFrame constructor accepts generators of dictionary rows.
    df = pandas.DataFrame(
        (row_null_to_NA(row) for row in cursor.fetchall()), 
        dtype="Int64"
    )
finally:
    conn.close()

print(df)

Note that this assumes that all your columns are integer columns. You might need to load the data column-by-column if not.

senderle
  • 145,869
  • 36
  • 209
  • 233
  • that parameter `dtype=object/int64` is available only in `pd.read_csv()`. I am reading data from Redshift. @senderle – Xyz Sep 16 '20 at 15:13
  • @LoneWalker interesting — can you point to the code or library you're using to read from Redshift? Is it [this](https://github.com/agawronski/pandas_redshift)? – senderle Sep 16 '20 at 15:30
  • I am using something like [this](https://stackoverflow.com/a/40336183/7592515). Using `sqlalchemy` library and `redshift+psycopg2` driver to connect to redshift and reading table data from there. @senderle – Xyz Sep 16 '20 at 20:08
  • I am using below code to read from Redshift table, in which I have one column which is bigint (int8) having that kind of data. Another observation, I would like to share, whenever I have number length `>16` and have `NULL` values. I face this issue. `from sqlalchemy import create_engine import pandas as pd connstr = 'redshift+psycopg2://:@/' engine = create_engine(connstr) with engine.connect() as conn, conn.begin(): df = pd.read_sql('''select * from schema.table_name''', conn) print(df)` – Xyz Sep 16 '20 at 20:19
  • 1
    @LoneWalker I will give this some more thought but I see that `read_sql` does not provide this kind of flexibility. It seems like the best fix will be to load the data in a different format and then convert to a DataFrame manually. I wonder if there's a way to get the result of the SQL query in JSON and then process that. – senderle Sep 17 '20 at 15:18
  • @LoneWalker if you get a chance, please let me know whether the code I posted at the bottom of this answer works as expected. – senderle Sep 17 '20 at 19:20
1

One of the fix can be instead of doing select * from schema.table_name. You can pass all columns separately and then cast the particular column.

Let's say, you have 5 columns in table and col2 is bigint(int8) column. So, you can read like below:

from sqlalchemy import create_engine 
import pandas as pd  
connstr = 'redshift+psycopg2://<username>:<password>@<cluster_name>/<db_name>' 
engine = create_engine(connstr) 
with engine.connect() as conn, conn.begin():     
    df = pd.read_sql('''select col1, cast(col2 as int), col3, col4, col5... from schema.table_name''', conn)
print(df)

P.S.: I am not sure this is the smartest solution but logically, if python is not able to cast to int64 properly then we can read casted value from SQL itself.

Further, I would like to try to cast int columns dynamically if it's length is more than 17.

Rajnish kumar
  • 186
  • 1
  • 14