How to use ODBC connection for pyspark.pandas

Question

In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas instead, the to_sql(...) method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.

Question: In the following code sample, how can we use ODBC connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?

import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime

data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')

.......

data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10), 
   'Name' : sq.VARCHAR(50),
   'OrderDate' : sq.DATETIME()})

Ref: Pandas API on Spark and this

UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.

It looks like `pyspark.pandas` only supports `JDBC`, [link](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/from_to_dbms.html#:~:text=You%20can%20also%20write%20it%20back%20to%20the%20stocks%20table%20as%20below%3A) Would it be possible to do `to_pandas()`, or is the file size too large? — BeRT2me, Apr 22 '22 at 04:57
@BeRT2me Good question - that prompted me to add an **UPDATE** section. — nam, Apr 23 '22 at 01:59

score 1 · Answer 1 · answered Apr 23 '22 at 02:18

I noticed you were appending the data to the table, so this work around came to mind.

Break the pyspark.pandas into chunks, and then export each chunk to pandas, and from there append the chunk.

n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]

for df in list_dfs:
    df = df.to_pandas()
    df.to_sql()

score 0 · Answer 2 · answered Apr 22 '22 at 05:51

As per the official pyspark.pandas documentation by Apache Spark, there is no such method available for this module which can load the pandas DataFrame to SQL Table.

Please see all provided methods here.

As an alternative approach, there are some similar asks mentioned in these SO threads. This might be helpful.

How to write to a Spark SQL table from a Panda data frame using PySpark?

How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook

How to use ODBC connection for pyspark.pandas

2 Answers2