0
df = spark.table("data").limit(100)
df = df.toPandas()

This conversion using .toPandas works just fine as df.limit is just a few rows. If I get rid of limit and do toPandas on the whole df, I get an error "Job aborted due to stage failure"

I've been using .pandas_api(), and its been working just fine, but I can't use it on sklearn functions. I tried passing a column into fit_transform and I get the error: "The method pd.Series.__iter__() is not implemented."

If I limit the dataset, use toPandas, then fit_transform works just fine.

How can I make this work?

I've tried this and it works:

df = spark.table("data").limit(100)
df = df.toPandas()
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
encoder = LabelEncoder()
df["p"] = encoder.fit_transform(df["p"])

removing the limit, won't let me convert to pandas. Instead, I tried the api:

df = df.pandas_api()

and it converts but then I can't pass the column into fit_transform.

Vitalizzare
  • 4,496
  • 7
  • 13
  • 32
dhk02
  • 1
  • 2

1 Answers1

0

I'll try my best to answer your question, but I'm not sure there's a solution - here's why:

What you are trying to do with the toPandas() call is convert your spark Dataframe to a pandas DataFrame. This operation actually requires storing all of your data in a pandas DataFrame which may make your execution run out of memory - hence your Job aborted due to stage failure. Although the error is not explicit, it's usually a sign indicating that something went wrong in terms of memory. There are other causes, but this seems most likely given that your execution works when you call the .limit function (which in your case limits your dataframes to 100 rows).

Ok, so the .toPandas() call should return a pandas DataFrame. But what does .pandas_api() do? Well, in the context of scikit-learn, not exactly what you'd want. Here's the .pandas_api() function documentation, but note that it returns a PandasOnSparkDataFrame object, and scikit-learn may expect a pandas Dataframe or some numpy array.

Having said this, one option is to split your large spark Dataframe into multiple pandas Dataframes using the limit call - i.e. try getting half of all your data, see if it fails. If it does, try getting 25%, and so on. You can save the resulting pandas DataFrames as CSV files, reading each of the resulting CSV files into pandas DataFrames in a for loop, and applying the same Label Encoder sequentially. Or you could try saving all the partial pandas DataFrames as numpy arrays pkl objects, and then concatenating all the matrixes to reconstruct your data.

Not perfect solutions, but they should get you on your way to finding a solution that works for you.

Sergiu
  • 1