I am trying to visualize word2vec words using pyspark's PCA function, but I'm getting an unhelpful error message. Saying column features are of the wrong type, but they aren't. (Full message below)
Background
spark-2.4.0-bin-hadoop2.7
Scala 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
3.6.5 |Anaconda, Inc.
Ubuntu 16.04
My Code
maxWordsVis = 15
Feat = np.load('Gab_ai_posts_W2Vmatrix.npy')
words = np.load('Gab_ai_posts_WordList.npy')
# to rdd, avoid this with big matrices by reading them directly from hdfs
Feat = sc.parallelize(Feat)
Feat = Feat.map(lambda vec: (Vectors.dense(vec),))
# to dataframe
dfFeat = sqlContext.createDataFrame(Feat,["features"])
$dfFeat.head()
Row(features=DenseVector([-0.1282, 0.0699, -0.0891, -0.0437, -0.0915, -0.0557, 0.1432, -0.1564, 0.0058, -0.0603, 0.1383, -0.0359, -0.0306, -0.0415, -0.0191, 0.058, 0.0119, -0.0302, 0.0362, -0.0466, 0.0403, -0.1035, 0.0456, 0.0892, 0.0548, -0.0735, 0.1094, -0.0299, -0.0549, -0.1235, 0.0062, 0.1381, -0.0082, 0.085, -0.0083, -0.0346, -0.0226, -0.0084, -0.0463, -0.0448, 0.0285, -0.0013, 0.0343, -0.0056, 0.0756, -0.0068, 0.0562, 0.0638, 0.023, -0.0224, -0.0228, 0.0281, -0.0698, -0.0044, 0.0395, -0.021, 0.0228, 0.0666, 0.0362, 0.0116, -0.0088, 0.0949, 0.0265, -0.0293, -0.007, -0.0746, 0.0891, 0.0145, 0.0532, -0.0084, -0.0853, 0.0037, -0.055, -0.0706, -0.0296, 0.0321, 0.0495, -0.0776, -0.1339, -0.065, 0.0856, 0.0328, 0.0821, 0.036, -0.0179, -0.0006, -0.036, 0.0438, -0.0077, -0.0012, 0.0322, 0.0354, 0.0513, 0.0436, 0.0002, -0.0578, 0.1062, 0.019, 0.0346, -0.1261]))
numComponents = 3
pca = PCA(k = numComponents, inputCol = "features", outputCol = "pcaFeatures")
Error Message
Py4JJavaError: An error occurred while calling o4583.fit. : java.lang.IllegalArgumentException: requirement failed:
Column features must be of type
struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually
struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:224)