The pyspark documentation says that the pandas-on-spark is distributed. If I create a dataframe using pyspark.pandas.read_csv('file.csv')
, how can I know the number of partitions of the pandas dataframe? Do we have an equivalent to df.rdd.getNumPartition()
for pandas-on-spark dataframe?
Asked
Active
Viewed 22 times
0
-
pandas API in Pyspark is an expensive operation to the best of my knowledge. To read csv we can use spark.read.csv.Once you use the df=pyspark.pandas.read_csv('file.csv') .please let me know if I missed anything. – Avind Aug 19 '23 at 15:06