How can I convert Spark dataframe column to Numpy array efficiently?

Question

I have a Spark dataframe with around 1 million rows. I am using pyspark and have to apply box-cox transformation from scipy library on each column of the dataframe. But the box-cox function allows only 1-d numpy array as input. How can I do this efficiently?

Is numpy array distributed on spark or it collects all the elements to single node on which driver program is running?

suppose df is my dataframe with column as C1 then, I want to perform the operation similar to this

stats.boxcox(df.select("C1"))

There is pretty much no case when you can benefit from having Spark DataFrame and be able process individual columns using Numpy. Basically either your data is small enough (cleaned, aggregated) that you can process it locally by converting to Pandas for example or you need a method that can work on distributed data which is not something that can be typically done with Numpy alone. — zero323, Jul 11 '16 at 23:04

score 0 · Answer 1 · edited May 23 '17 at 11:51

0

The dataframes/RDD in Spark allow abstracting from how the processing is distributed.

To do what you require, I think a UDF can be very useful. Here you can see an example of its use:

Functions from Python packages for udf() of Spark dataframe

edited May 23 '17 at 11:51

Community

1
1

answered Jul 10 '16 at 18:42

Josemy

810
1
12
29

2

Thanks for the reply. I have to apply the following function from scipy library, which accepts only ndarray as input not the single element. stats.boxcox(x) where x is 1-d numpy array – Sandeep Veerlapati Jul 11 '16 at 05:02

score 0 · Answer 2 · answered Feb 27 '18 at 06:25

I have a workaround that solve the issue but not sure is the optimal solution in term of performance as you are switching between pyspark and pandas dataframes:

dfpd = df.toPandas()
colName = 'YOUR_COLUMN_NAME'
colBCT_Name = colName + '_BCT'
print colBCT_Name
maxVal = dfpd[colName][dfpd[colName].idxmax()]
minVal = dfpd[colName][dfpd[colName].idxmin()]
print maxVal
print minVal

col_bct, l = stats.boxcox(dfpd[colName]- minVal +1)
col_bct = col_bct*l/((maxVal +1)**l-1)
col_bct =pd.Series(col_bct)
dfpd[colBCT_Name] = col_bct
df = sqlContext.createDataFrame(dfpd)
df.show(2)

How can I convert Spark dataframe column to Numpy array efficiently?

2 Answers2