18

it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.

I've tried this:

%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

but I cannot get rid of :

TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector

I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.

Many thanks.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Jan Sila
  • 1,554
  • 3
  • 17
  • 36

3 Answers3

12

From Numpy to Pandas to Spark:

data = np.random.rand(4,4)
df = pd.DataFrame(data, columns=list('abcd'))
spark.createDataFrame(df).show()

Output:

+-------------------+-------------------+------------------+-------------------+
|                  a|                  b|                 c|                  d|
+-------------------+-------------------+------------------+-------------------+
| 0.8026427193838694|0.16867056812634307|0.2284873209015007|0.17141853164400833|
| 0.2559088794287595| 0.3896957084615589|0.3806810025185623| 0.9362280141470332|
|0.41313827425060257| 0.8087580640179158|0.5547653674054028| 0.5386190454838264|
| 0.2948395900484454| 0.4085807623354264|0.6814694724946697|0.32031773805256325|
+-------------------+-------------------+------------------+-------------------+
Jeff Hernandez
  • 2,063
  • 16
  • 20
  • The thing is, if you are to continue processing this data with Spark ML, you are going to need something like [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler) further downstream, in order to convert the 4 columns into a single one, like `features` in my answer, as Spark ML needs the features in this form... – desertnaut Sep 12 '18 at 17:15
8

You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint when using spark-ml:

sc.version
# u'2.1.1'

import numpy as np
from pyspark.ml.linalg import Vectors

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(dff,schema=["label", "features"])

mydf.show(5)
# +-----+-------------+ 
# |label|     features| 
# +-----+-------------+ 
# |    1|[0.0,0.0,0.0]| 
# |    0|[0.0,1.0,1.0]| 
# |    0|[0.0,1.0,0.0]| 
# |    1|[0.0,0.0,1.0]| 
# |    0|[0.0,1.0,0.0]|
# +-----+-------------+

PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]

eliasah
  • 39,588
  • 11
  • 124
  • 154
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    I think it's very important to clear these things up because this is where the mess begins. It's not the first time OPs mix up between them and at a certain point they ask themselves what should they use. – eliasah Jul 13 '17 at 08:35
  • 1
    yh when you see it first time its gets a bit confusing :) https://www.nodalpoint.com/spark-classification/ – Jan Sila Jul 13 '17 at 09:23
  • I think you want to use `column_stack` instead of `concatenate` – Amanda Feb 22 '19 at 07:28
2

The problem is easy to solve. You're using the ml and the mllib API at the same time. Stick to one. Otherwise you get this error.

This is the solution for the mllibAPI:

import numpy as np
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

For the ml API, you don't really need LabeledPoint anymore. Here is an example. I would suggest to use the ml API since the mllib API is going to deprecated soon.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Dat Tran
  • 2,368
  • 18
  • 25
  • thanks a lot for your answer as well. I awarded desertnaut's and upvoted yours. Thanks a lot! – Jan Sila Jul 13 '17 at 08:13
  • upvoted too, since it is complementary to mine (`mllib`). It is not visible now, but both our answers came with just 2 minutes difference - cool... :) – desertnaut Jul 13 '17 at 08:22