This recent blog post from Databricks https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html says that the only change needed to a pandas program to run it under pyspark.pandas is to change from pandas import read_csv
to from pyspark.pandas import read_csv
.
But that does not seem right. What about all the other (non read_csv
) references to pandas? Isn't the right approach to change import pandas as pd
to import pyspark.pandas as pd
? Then all the other pandas references in your existing program will point to the pyspark version of pandas.