6

This recent blog post from Databricks https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html says that the only change needed to a pandas program to run it under pyspark.pandas is to change from pandas import read_csv to from pyspark.pandas import read_csv.

But that does not seem right. What about all the other (non read_csv) references to pandas? Isn't the right approach to change import pandas as pd to import pyspark.pandas as pd? Then all the other pandas references in your existing program will point to the pyspark version of pandas.

pltc
  • 5,836
  • 1
  • 13
  • 31
Chuck Connell
  • 323
  • 4
  • 11

1 Answers1

5

You got that right. The canonical way they have suggested, however, is, from pyspark import pandas as ps

figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
  • Thanks. I can test that. So that means going through my entire program and changing all pd to ps? Not exactly a one line change. – Chuck Connell Oct 27 '21 at 02:42
  • That line does not work. It results in the error... ImportError: cannot import name 'pandas' from 'pyspark.pandas' (/databricks/spark/python/pyspark/pandas/__init__.py) – Chuck Connell Oct 27 '21 at 17:10
  • 1
    Sorry, my bad. Edited the answer! – figs_and_nuts Oct 28 '21 at 03:04
  • 1
    The above works. But for my code I decided that it is more clear to use pspd (PySpark pandas) instead of ps. This distinguishes PySpark.pandas from PySpark itself. – Chuck Connell Oct 30 '21 at 23:07