-2

I'm trying to reduce the number of features of a dataset of images so that cosine similarity computes faster.

I have a pandas dataframe that has the following structure ["url", "cluster_id", "features"] and that contains 81 rows.

I would like to apply sklearn PCA on the column "features" that contains, for each row, a DenseVector (2048 elements to be exact).

The problem is that when I apply

pca = skPCA(n_components = 1024) 
pca_pd = pca.fit(list(test_pd["features"].values))

I actually reduce the number of rows and not the number of features for each row.

#Output
pca.components_
array([[-0.0232138 ,  0.01177754, -0.0022028 , ...,  0.00181739,
         0.00500531,  0.00900601],
       [ 0.02912731,  0.01187949,  0.00375974, ..., -0.00153819,
         0.0025645 ,  0.0210677 ],
       [ 0.00099789,  0.02129508,  0.00229157, ..., -0.0045913 ,
         0.00239336, -0.01231318],
       [-0.00134043,  0.01609966,  0.00277412, ..., -0.00944288,
         0.00907663, -0.04781827],
       [-0.01286403,  0.00666523, -0.00318833, ...,  0.00101012,
         0.0045756 , -0.0043937 ]])

Do you have an idea on how to solve that problem ?

Copp
  • 83
  • 1
  • 12

1 Answers1

0

I think it is better not to use list, but a dataframe or numpy array. If I am not wrong, DenseVector is a data type from Spark.

To convert it: densevector.toArray()

If you are using scikit-learn PCA you should also do a transform not only a fit.

Like, pca.fit_transform(array)

BCJuan
  • 805
  • 8
  • 17
  • That's exact, I'm using pyspark. Since I compute the cosine similarity on each cluster that are not that big ( ~ 200 items per cluster) I thought it would make more sense to use sklearn PCA instead of pyspark PCA. Thank you for your answer, I'm going to try that. – Copp Apr 17 '19 at 08:53