Python Read from SQL to pandas dataframes

Asked Mar 21 '18 at 16:16

Active Mar 21 '18 at 19:43

Viewed 37 times

I am using the below script to read data from MSSQL Server to Pyspark dataframes.

DFFSA = spark.read.format("jdbc").option("url", jdbcURLDev).option("driver", MSSQLDriver ).option("dbtable", "FSA.dbo.FSA").option("user", "DevUser").option("password", "password").load();

This generates a Pyspark dataframe. How can I do so with pandas dataframe? I know I can convert the resulting dataframe into a pandas dataframe using the toPandas() function but this is taking a lot of time since I am reading millions of rows.

edited Mar 21 '18 at 19:43

pault

41,343
15
107
149

asked Mar 21 '18 at 16:16

Ahmad

your pandas read into dataframe might not be faster. What is the size of the data you are looking at? – usernamenotfound Mar 21 '18 at 16:18
@cᴏʟᴅsᴘᴇᴇᴅ this is actually not a dupe of that question. He's asking how to convert a pyspark dataframe to pandas, which can be done using the `toPandas()` method. A better dupe candidate would be: https://stackoverflow.com/questions/34817549/how-to-convert-spark-rdd-to-pandas-dataframe-in-ipython – pault Mar 21 '18 at 19:41
1

@pault Thanks a bunch, edited the duplicates list. – cs95 Mar 21 '18 at 19:42
`toPandas()` will be slow for millions of rows. There's no way around it because all of your data has to be serialized through the head node. A better question is why do you need it in pandas? You should think about how to do your processing in spark instead. – pault Mar 21 '18 at 19:42

Python Read from SQL to pandas dataframes

0 Answers0