PCA "could not convert string to float"

Question

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

data = pd.read_csv("tfidf_smogon.csv")
data.drop(['Categoría'], axis=1, inplace=True)
data.drop(data.columns[0], axis=1, inplace=True)
print(data)

pca = PCA(n_components=3)
pca.fit(data)
print('se alimentó')
x_pca = pca.transform(data)

miLista = ['PCA1', 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6', 'PCA7', 'PCA8']
tablaPCA = pd.DataFrame(data=x_pca, columns=miLista)
print(tablaPCA)

#Ahora vamos a agrupar los comentarios en base a estos 3 PCA
km = KMeans(n_clusters=2, n_init=100)
lista_de_cluster = km.fit_predict(tablaPCA)
print(lista_de_cluster)

tablaPCA["Cluster"]=lista_de_cluster
print(tablaPCA)
tablaPCA.to_csv("PCA_smogon.csv")

More specifically on the beginning of PCA lines "pca.fit(data)" bc when I run it it shows again "could not convert string to float"

with the pca.fit and transform I expected to run the code and have the matrix but it says could not convert string to float and I also tried just because the label function but doesnt work too so I tried to put fit_transform too and nothing so i dont know what to do bc i've been doing this with some guide files.

You can't pass strings to sklearn's PCA and your column has strings. — Celius Stingher, Jun 05 '23 at 12:59
@Chris I tried but doesn't work, but idk if i did it correctly, i followed every step on that. — yoryi, Jun 05 '23 at 13:03
If you did try different a different approach and it's not working, either post a new question or update it and I'll vote to reopen. The current scope and question is a duplicate from the one being referred. — Celius Stingher, Jun 05 '23 at 13:05

score 0 · Answer 1 · answered Jun 05 '23 at 13:02

You need to convert the input string data into numerical data.

One idea is to use one-hot encoding.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA

X = ["one", "two", "three"]

# here: convert text into a matrix representation where each row corresponds to a string, and each column represents a unique word from the entire corpus of strings
vectorizer = CountVectorizer()
X_numerical = vectorizer.fit_transform(X).toarray()

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_numerical)

Thanks, it worked, now shows another problem but its an advance. — yoryi, Jun 05 '23 at 13:07

PCA "could not convert string to float"

1 Answers1