df = spark.table("data").limit(100)
df = df.toPandas()
This conversion using .toPandas
works just fine as df.limit
is just a few rows. If I get rid of limit and do toPandas
on the whole df, I get an error "Job aborted due to stage failure"
I've been using .pandas_api()
, and its been working just fine, but I can't use it on sklearn functions. I tried passing a column into fit_transform
and I get the error: "The method pd.Series.__iter__()
is not implemented."
If I limit the dataset, use toPandas
, then fit_transform
works just fine.
How can I make this work?
I've tried this and it works:
df = spark.table("data").limit(100)
df = df.toPandas()
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
encoder = LabelEncoder()
df["p"] = encoder.fit_transform(df["p"])
removing the limit, won't let me convert to pandas. Instead, I tried the api:
df = df.pandas_api()
and it converts but then I can't pass the column into fit_transform
.