What .map()
function in python do I use to create a set of labeledPoints
from a spark dataframe? What is the notation if The label/outcome is not the first column but I can refer to its column name, 'status'?
I create the Python dataframe with this .map() function:
def parsePoint(line):
listmp = list(line.split('\t'))
dataframe = pd.DataFrame(pd.get_dummies(listmp[1:]).sum()).transpose()
dataframe.insert(0, 'status', dataframe['accepted'])
if 'NULL' in dataframe.columns:
dataframe = dataframe.drop('NULL', axis=1)
if '' in dataframe.columns:
dataframe = dataframe.drop('', axis=1)
if 'rejected' in dataframe.columns:
dataframe = dataframe.drop('rejected', axis=1)
if 'accepted' in dataframe.columns:
dataframe = dataframe.drop('accepted', axis=1)
return dataframe
I convert it to a Spark dataframe after the reduce function has recombined all the Pandas dataframes.
parsedData=sqlContext.createDataFrame(parsedData)
But now how do I create labledPoints
from this in Python? I assume it may be another .map()
function?