Creating Spark dataframe from numpy matrix

Question

it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.

I've tried this:

%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

but I cannot get rid of :

TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector

I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.

Many thanks.

Jeff Hernandez · Answer 1 · 2020-12-07T18:27:36.797

From Numpy to Pandas to Spark:

data = np.random.rand(4,4)
df = pd.DataFrame(data, columns=list('abcd'))
spark.createDataFrame(df).show()

Output:

+-------------------+-------------------+------------------+-------------------+
|                  a|                  b|                 c|                  d|
+-------------------+-------------------+------------------+-------------------+
| 0.8026427193838694|0.16867056812634307|0.2284873209015007|0.17141853164400833|
| 0.2559088794287595| 0.3896957084615589|0.3806810025185623| 0.9362280141470332|
|0.41313827425060257| 0.8087580640179158|0.5547653674054028| 0.5386190454838264|
| 0.2948395900484454| 0.4085807623354264|0.6814694724946697|0.32031773805256325|
+-------------------+-------------------+------------------+-------------------+

The thing is, if you are to continue processing this data with Spark ML, you are going to need something like [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler) further downstream, in order to convert the 4 columns into a single one, like `features` in my answer, as Spark ML needs the features in this form... — desertnaut, Sep 12 '18 at 17:15

score 8 · Accepted Answer · edited Jul 13 '17 at 08:31

8

You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint when using spark-ml:

sc.version
# u'2.1.1'

import numpy as np
from pyspark.ml.linalg import Vectors

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(dff,schema=["label", "features"])

mydf.show(5)
# +-----+-------------+ 
# |label|     features| 
# +-----+-------------+ 
# |    1|[0.0,0.0,0.0]| 
# |    0|[0.0,1.0,1.0]| 
# |    0|[0.0,1.0,0.0]| 
# |    1|[0.0,0.0,1.0]| 
# |    0|[0.0,1.0,0.0]|
# +-----+-------------+

PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]

edited Jul 13 '17 at 08:31

eliasah

39,588
11
124
154

answered Jul 12 '17 at 18:17

desertnaut

57,590
26
140
166

1

I think it's very important to clear these things up because this is where the mess begins. It's not the first time OPs mix up between them and at a certain point they ask themselves what should they use. – eliasah Jul 13 '17 at 08:35
1

yh when you see it first time its gets a bit confusing :) https://www.nodalpoint.com/spark-classification/ – Jan Sila Jul 13 '17 at 09:23
I think you want to use `column_stack` instead of `concatenate` – Amanda Feb 22 '19 at 07:28

score 2 · Answer 3 · edited Jul 13 '17 at 08:23

2

The problem is easy to solve. You're using the ml and the mllib API at the same time. Stick to one. Otherwise you get this error.

This is the solution for the mllibAPI:

import numpy as np
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

For the ml API, you don't really need LabeledPoint anymore. Here is an example. I would suggest to use the ml API since the mllib API is going to deprecated soon.

edited Jul 13 '17 at 08:23

desertnaut

57,590
26
140
166

answered Jul 12 '17 at 18:19

Dat Tran

2,368
18
25

thanks a lot for your answer as well. I awarded desertnaut's and upvoted yours. Thanks a lot! – Jan Sila Jul 13 '17 at 08:13
upvoted too, since it is complementary to mine (`mllib`). It is not visible now, but both our answers came with just 2 minutes difference - cool... :) – desertnaut Jul 13 '17 at 08:22

Creating Spark dataframe from numpy matrix

3 Answers3

Linked