Convert list of pyodbc.rows to pandas Dataframe takes very long time

Question

Is there a faster way to convert pyodbc.rows object to pandas Dataframe? It take about 30-40 minutes to convert a list of 10 million+ pyodbc.rows objects to pandas dataframe.

import pyodbc
import pandas

server = <server_ip> 
database = <db_name> 
username = <db_user> 
password = <password> 
port='1443'

conn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';PORT='+port+';DATABASE='+database+';UID='+username+';PWD='+ password)

#takes upto 12 minutes
rows = cursor.execute("select top 10000000 * from [LSLTGT].[MBR_DIM] ").fetchall() 

#Read cursor data into Pandas dataframe.....Takes forever!
df = pandas.DataFrame([tuple(t) for t in rows])

If you are able to use sqlalchemy, you could look at pandas.read_sql (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) — meatballs, Nov 26 '18 at 17:21
@Owen That was my previous issue. I tried using pandas.read_sql and it takes a very long time to read all the data. Please See [link](https://stackoverflow.com/questions/53382633/slow-loading-sql-server-table-into-pandas-dataframe) I am trying to find a faster way to load data from SQL server to pandas Dataframe Just Once and then I plan to store the df into feathers format for subsequent faster reads. — Anjana Shivangi, Nov 26 '18 at 17:27
How long does that query take to execute in Management Studio? My guess is that pandas is not the problem here. — meatballs, Nov 26 '18 at 17:40
@Owen - On SSMS it takes 8:25 minutes to read 10 million records. — Anjana Shivangi, Nov 26 '18 at 18:24
Is SSMS running on the same machine as your python code does? — meatballs, Nov 27 '18 at 09:19
@Owen SSMS and python code run on the same machine but the Database is stored on a remote server. I connect to it through SSMS and python code as needed. — Anjana Shivangi, Nov 27 '18 at 19:39
@AnjanaShivangi did you find a solution to this? I've noticed this as well, using R to read in data has turned out to be significantly faster than python and I'm not sure why that is. — Ken Myers, Feb 20 '19 at 00:02

score 12 · Answer 1 · answered Nov 28 '18 at 10:43

12

You might get some improvement by using a generator expression rather than a list comprehension:

df = pandas.DataFrame((tuple(t) for t in rows))

answered Nov 28 '18 at 10:43

meatballs

3,917
1
15
19

perfect solution :) – Karthic Srinivasan Jul 21 '23 at 01:02

score 8 · Answer 2 · answered Oct 27 '20 at 14:34

8

There is also an option to do this directly with pandas:

df = pd.DataFrame.from_records(rows, columns=[col[0] for col in cursor.description])

answered Oct 27 '20 at 14:34

ira

2,542
2
22
36

Convert list of pyodbc.rows to pandas Dataframe takes very long time

2 Answers2

Linked