Py4JJava wrong columns error when calling PCA of pyspark.ml.feature

Question

I am trying to visualize word2vec words using pyspark's PCA function, but I'm getting an unhelpful error message. Saying column features are of the wrong type, but they aren't. (Full message below)

Background

spark-2.4.0-bin-hadoop2.7

Scala 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).

3.6.5 |Anaconda, Inc.

Ubuntu 16.04

My Code

maxWordsVis = 15

Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')  
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat) 
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])

$dfFeat.head()

Row(features=DenseVector([-0.1282, 0.0699, -0.0891, -0.0437, -0.0915, -0.0557, 0.1432, -0.1564, 0.0058, -0.0603, 0.1383, -0.0359, -0.0306, -0.0415, -0.0191, 0.058, 0.0119, -0.0302, 0.0362, -0.0466, 0.0403, -0.1035, 0.0456, 0.0892, 0.0548, -0.0735, 0.1094, -0.0299, -0.0549, -0.1235, 0.0062, 0.1381, -0.0082, 0.085, -0.0083, -0.0346, -0.0226, -0.0084, -0.0463, -0.0448, 0.0285, -0.0013, 0.0343, -0.0056, 0.0756, -0.0068, 0.0562, 0.0638, 0.023, -0.0224, -0.0228, 0.0281, -0.0698, -0.0044, 0.0395, -0.021, 0.0228, 0.0666, 0.0362, 0.0116, -0.0088, 0.0949, 0.0265, -0.0293, -0.007, -0.0746, 0.0891, 0.0145, 0.0532, -0.0084, -0.0853, 0.0037, -0.055, -0.0706, -0.0296, 0.0321, 0.0495, -0.0776, -0.1339, -0.065, 0.0856, 0.0328, 0.0821, 0.036, -0.0179, -0.0006, -0.036, 0.0438, -0.0077, -0.0012, 0.0322, 0.0354, 0.0513, 0.0436, 0.0002, -0.0578, 0.1062, 0.019, 0.0346, -0.1261]))

numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")

Error Message

Py4JJavaError: An error occurred while calling o4583.fit. : java.lang.IllegalArgumentException: requirement failed: 
Column features must be of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.  
     at scala.Predef$.require(Predef.scala:224)

Did you find a solution to this? – Shibani Oct 29 '19 at 22:11 — Shibani, Oct 29 '19 at 22:11

Py4JJava wrong columns error when calling PCA of pyspark.ml.feature

Background

My Code

Error Message

0 Answers0