converting pyspark dataframe to labelled point object

Question

df:
[Row(split(value,,)=[u'21.0', u'1',u'2']),Row(split(value,,)=[u'22.0', u'3',u'4'])]

how to convert each row in df into a LabeledPoint object, which consists of a label and features, where the first value is the label and the rest 2 are features in each row.

mycode:

df.map(lambda row:LabeledPoint(row[0],row[1: ]))

It does not seem to work, new to spark hence any suggestions would be helpful.

Possible duplicate of [error in labelled point object pyspark](http://stackoverflow.com/questions/38887157/error-in-labelled-point-object-pyspark) — , Aug 11 '16 at 22:07
@LostInOverflow no it is not, this is from dataframe and that was from RDD. — data_person, Aug 11 '16 at 22:16
Have you look at this question? [stackoverflow question about spark/labelledpoint](http://stackoverflow.com/questions/32556178/create-labeledpoints-from-spark-dataframe-in-python) — itza, Aug 11 '16 at 23:21

score 4 · Accepted Answer · answered Aug 12 '16 at 11:21

If you want to obtain an RDD you need to create a function to parse your Array of String.

a = sc.parallelize([([u'21.0', u'1',u'2'],),([u'22.0', u'3',u'4'],)]).toDF(["value"])
a.printSchema()

#root
#|-- value: array (nullable = true)
#|    |-- element: string (containsNull = true)

To achieve this check my function.

def parse(l):
  l = [float(x) for x in l]

  return LabeledPoint(l[0], l[1:])

After defining such function, map your DataFrame in order to map its internal RDD.

a.map(lambda l: parse(l[0])).take(2)

# [LabeledPoint(21.0, [1.0,2.0]), LabeledPoint(22.0, [3.0,4.0])]

Here you can find the published notebook where I tested everything.

PD: If you use toDF you will obtain two columns (features and label).

converting pyspark dataframe to labelled point object

1 Answers1