0

I am using the below script to read data from MSSQL Server to Pyspark dataframes.

DFFSA = spark.read.format("jdbc").option("url", jdbcURLDev).option("driver", MSSQLDriver ).option("dbtable", "FSA.dbo.FSA").option("user", "DevUser").option("password", "password").load();

This generates a Pyspark dataframe. How can I do so with pandas dataframe? I know I can convert the resulting dataframe into a pandas dataframe using the toPandas() function but this is taking a lot of time since I am reading millions of rows.

pault
  • 41,343
  • 15
  • 107
  • 149
Ahmad
  • 23
  • 5
  • your pandas read into dataframe might not be faster. What is the size of the data you are looking at? – usernamenotfound Mar 21 '18 at 16:18
  • @cᴏʟᴅsᴘᴇᴇᴅ this is actually not a dupe of that question. He's asking how to convert a pyspark dataframe to pandas, which can be done using the `toPandas()` method. A better dupe candidate would be: https://stackoverflow.com/questions/34817549/how-to-convert-spark-rdd-to-pandas-dataframe-in-ipython – pault Mar 21 '18 at 19:41
  • 1
    @pault Thanks a bunch, edited the duplicates list. – cs95 Mar 21 '18 at 19:42
  • `toPandas()` will be slow for millions of rows. There's no way around it because all of your data has to be serialized through the head node. A better question is why do you need it in pandas? You should think about how to do your processing in spark instead. – pault Mar 21 '18 at 19:42

0 Answers0