2
df:
[Row(split(value,,)=[u'21.0', u'1',u'2']),Row(split(value,,)=[u'22.0', u'3',u'4'])]

how to convert each row in df into a LabeledPoint object, which consists of a label and features, where the first value is the label and the rest 2 are features in each row.

mycode:

df.map(lambda row:LabeledPoint(row[0],row[1: ]))

It does not seem to work, new to spark hence any suggestions would be helpful.

data_person
  • 4,194
  • 7
  • 40
  • 75
  • Possible duplicate of [error in labelled point object pyspark](http://stackoverflow.com/questions/38887157/error-in-labelled-point-object-pyspark) –  Aug 11 '16 at 22:07
  • @LostInOverflow no it is not, this is from dataframe and that was from RDD. – data_person Aug 11 '16 at 22:16
  • @LostInOverflow can u suggest me something for this? – data_person Aug 11 '16 at 22:17
  • Have you look at this question? [stackoverflow question about spark/labelledpoint](http://stackoverflow.com/questions/32556178/create-labeledpoints-from-spark-dataframe-in-python) – itza Aug 11 '16 at 23:21

1 Answers1

4

If you want to obtain an RDD you need to create a function to parse your Array of String.

a = sc.parallelize([([u'21.0', u'1',u'2'],),([u'22.0', u'3',u'4'],)]).toDF(["value"])
a.printSchema()

#root
#|-- value: array (nullable = true)
#|    |-- element: string (containsNull = true)

To achieve this check my function.

def parse(l):
  l = [float(x) for x in l]

  return LabeledPoint(l[0], l[1:])

After defining such function, map your DataFrame in order to map its internal RDD.

a.map(lambda l: parse(l[0])).take(2)

# [LabeledPoint(21.0, [1.0,2.0]), LabeledPoint(22.0, [3.0,4.0])]

Here you can find the published notebook where I tested everything.

PD: If you use toDF you will obtain two columns (features and label).

Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93