I have fitted a k-means algorithm on 5000+ samples after converting to vector after using tfidf. I want to label 5 nearest points from the 15 clusters formed.I have the labels on a different dataframe, but do not want to use them completely.How to I look up for the exact (15*5) indices of the 5000 samples that need to be marked based on the indices of the separate labels dataframe.
X_train_df=pd.read_csv("X_train.csv")
y_train=pd.read_csv("y_train.csv")
# Assuming X_train_df is a pandas DataFrame with a 'text' column
X_train_df['cleaned_text'] = X_train_df['review'].apply(clean_text)
X_train_df['cleaned_text']=X_train_df['cleaned_text'].to_frame()
from sklearn.model_selection import train_test_split
dX_train, dX_test, dy_train, dy_test =
train_test_split(X_train_df["cleaned_text"],y_train,test_size=0.2)
from sklearn.cluster import KMeans
import numpy as np
k = 20
kmeans = KMeans(n_clusters = k)
X_digits_dist = kmeans.fit_transform(X_train_vectorized)
# digit with smallest distance
representative_digit_idx = np.argmin(X_digits_dist, axis=0)
vectorizer = TfidfVectorizer(max_features=10000)
X_train_vectorized = vectorizer.fit_transform(dX_train)
X_test_vectorized = vectorizer.transform(dX_test)
nearest_indices = np.argsort(X_digits_dist, axis=0)[:3]
nearest_indices_flattened = nearest_indices.flatten()
X_representative_digits=
X_train_vectorized[nearest_indices_flattened]
y_representative_digits=dy_train.iloc[nearest_indices_flattened]
log_reg = RandomForestClassifier(n_estimators=1000)
log_reg = log_reg.fit(X_representative_digits,
y_representative_digits)
new_scr = log_reg.score(X_test_vectorized, dy_test)
print(f'Accuracy with only 50 representative training examples:
{new_scr:.2%}')
I tried to take randomly the first 50 label for the first 50 vectorized label. And it gave me around 60% accuracy.But when l tried using the clustering method to retrieve random labels for training my model, model performance reduced. It's a classification dataset with balanced classes Not sure of the reason