In my following python
code I successfully can connect to MS Azure SQL Db
using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas
instead, the to_sql(...)
method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.
Question: In the following code sample, how can we use ODBC
connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?
import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime
data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
.......
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})
Ref: Pandas API on Spark and this
UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.