0

I am trying to visualize word2vec words using pyspark's PCA function, but I'm getting an unhelpful error message. Saying column features are of the wrong type, but they aren't. (Full message below)

Background

spark-2.4.0-bin-hadoop2.7

Scala 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).

3.6.5 |Anaconda, Inc.

Ubuntu 16.04

My Code

maxWordsVis = 15

Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')  
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat) 
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])

$dfFeat.head()

Row(features=DenseVector([-0.1282, 0.0699, -0.0891, -0.0437, -0.0915, -0.0557, 0.1432, -0.1564, 0.0058, -0.0603, 0.1383, -0.0359, -0.0306, -0.0415, -0.0191, 0.058, 0.0119, -0.0302, 0.0362, -0.0466, 0.0403, -0.1035, 0.0456, 0.0892, 0.0548, -0.0735, 0.1094, -0.0299, -0.0549, -0.1235, 0.0062, 0.1381, -0.0082, 0.085, -0.0083, -0.0346, -0.0226, -0.0084, -0.0463, -0.0448, 0.0285, -0.0013, 0.0343, -0.0056, 0.0756, -0.0068, 0.0562, 0.0638, 0.023, -0.0224, -0.0228, 0.0281, -0.0698, -0.0044, 0.0395, -0.021, 0.0228, 0.0666, 0.0362, 0.0116, -0.0088, 0.0949, 0.0265, -0.0293, -0.007, -0.0746, 0.0891, 0.0145, 0.0532, -0.0084, -0.0853, 0.0037, -0.055, -0.0706, -0.0296, 0.0321, 0.0495, -0.0776, -0.1339, -0.065, 0.0856, 0.0328, 0.0821, 0.036, -0.0179, -0.0006, -0.036, 0.0438, -0.0077, -0.0012, 0.0322, 0.0354, 0.0513, 0.0436, 0.0002, -0.0578, 0.1062, 0.019, 0.0346, -0.1261]))

numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")

Error Message

Py4JJavaError: An error occurred while calling o4583.fit. : java.lang.IllegalArgumentException: requirement failed: 
Column features must be of type 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually 
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.  
     at scala.Predef$.require(Predef.scala:224)
Gabriel Fair
  • 4,081
  • 5
  • 33
  • 54

0 Answers0