1

I am using Pandas "read_sql" to pull a data set from SQL Server into a dataframe, using Pypyodbc. However, it looks like sometimes (NOT ALWAYS), when there are NULLs in a field, the datatype isn't int64, but float64.

I have two fields that are both declare as INT in SQL Server. One sometimes has NULLs, the other one appears to always have NULLs.

Here's the schema in SQL Server:

[PLAN_ID] [int] NULL,
[DESTINATION_ID] [int] NULL,
[STORE_ID] [int] NULL,

If I use dict to look at those fields, I see: (there's others, but I'm not sure how to read DICT, so I'm including the previous line)

Name: plan_id, Length: 13193, dtype: int64, 'destination_id': 0        None
1        None
2        None
3        None
4        None
         ...
13188    None
13189    None
13190    None
13191    None
13192    None
Name: dest_id, Length: 13193, dtype: object, 'store_id': 0        175635.0
1        180942.0
2             NaN
3             NaN
4             NaN
           ...
13188     59794.0
13189    180015.0
13190     94819.0
13191    184716.0
13192    182301.0
Name: store_id, Length: 13193, dtype: float64, 'version': 0

Here's the code I'm using

import pandas as pd
import pypyodbc
from datetime import timedelta, date

start_date = date(2019, 5, 1)
end_date = date(2019, 5, 2)
daterange = pd.date_range(start_date, end_date)

con_string = ('Driver={SQL Server};'
'Server=mysqlservername;'
'Database=mydbname;'
'App=PythonPull;'  #It's not "application name"!
'Trusted_Connection=yes')
cnxn = pypyodbc.connect(con_string)


for single_date in daterange:
    datestr = single_date.strftime("%Y-%m-%d")
    print(datestr)
    tablelist = ["mytablenamehere"]
    for item in tablelist:
        query = f"""
        declare @start_date datetime = '{datestr}'
        declare @end_date   datetime  = dateadd(day,1,'{datestr}')
        SELECT id, customerid FROM mydbname.dbo.{item} with (nolock)
        where submitted >= @start_date and submitted < @end_date
        order by submitted
        """
        result_list = pd.read_sql(query, cnxn)
        #at this point, running result_port_map.__dict__ shows that the ID is an int64, but the customerid is a float64
mbourgon
  • 1,286
  • 2
  • 17
  • 35

1 Answers1

3

Here's a neat trick using pandas 0.24.0+:

Use astype with pd.Int64Dtype nullable Integer datatypes

MVCE:

l = [1, 2, 3, np.nan]
s = pd.Series(l)

Ouput s:

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

s.dtype

dtype('float64')

Now, let's use astype with pd.Int64Dtype:

s = s.astype(pd.Int64Dtype())

Output s:

0      1
1      2
2      3
3    NaN
dtype: Int64

s.dtype

Int64Dtype
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Gig'em, and thanks! Is there any way to automatically convert any float64s to INT/BIGINT, or do I have to tell it on every field? I've got 50 tables, each with dozens of fields, and read_sql/read_sql_query doesn't appear to have any good solutions/ – mbourgon Sep 16 '19 at 18:43
  • 1
    Here's a pretty cool way of doing that. `df.assign(**df.select_dtypes(include='float').astype(pd.Int64Dtype()))` Gig'em! – Scott Boston Sep 16 '19 at 19:45
  • one last one for you - Is there any way to say "change any of these fields if they exist"? I just realized that I have other floats that are valid decimals, so now I'm looking at changing just the fields that need id. Lookup IDs, basically. I was trying `convert_dict = {"A": Int64Dtype}` and then `df=df.astype(convert_dict)` , but it says Int64Dtype is not defined. I did get `df['A'] = df['A'].astype(pd.Int64Dtype())` to work, but if the field doesn't exist, then it barfs, (just like the convert_dict). Thanks again! – mbourgon Sep 17 '19 at 21:13
  • 1
    This was very helpful. Just want to point out for others, you can also use a Int32Dtype (or 8, 16) if you don't need 64 bits and since you're already converting. – Gabe Nov 20 '20 at 13:12