Indices of cluster for labelling to perform semi supervised learning

Question

I have fitted a k-means algorithm on 5000+ samples after converting to vector after using tfidf. I want to label 5 nearest points from the 15 clusters formed.I have the labels on a different dataframe, but do not want to use them completely.How to I look up for the exact (15*5) indices of the 5000 samples that need to be marked based on the indices of the separate labels dataframe.

X_train_df=pd.read_csv("X_train.csv")
y_train=pd.read_csv("y_train.csv")
# Assuming X_train_df is a pandas DataFrame with a 'text' column
X_train_df['cleaned_text'] = X_train_df['review'].apply(clean_text)

X_train_df['cleaned_text']=X_train_df['cleaned_text'].to_frame()
from sklearn.model_selection import train_test_split
dX_train, dX_test, dy_train, dy_test = 
train_test_split(X_train_df["cleaned_text"],y_train,test_size=0.2)

from sklearn.cluster import KMeans
import numpy as np

k = 20

kmeans = KMeans(n_clusters = k)
X_digits_dist = kmeans.fit_transform(X_train_vectorized)

# digit with smallest distance
representative_digit_idx = np.argmin(X_digits_dist, axis=0)

vectorizer = TfidfVectorizer(max_features=10000)
X_train_vectorized = vectorizer.fit_transform(dX_train)
X_test_vectorized = vectorizer.transform(dX_test)

nearest_indices = np.argsort(X_digits_dist, axis=0)[:3]
nearest_indices_flattened = nearest_indices.flatten()
X_representative_digits=
X_train_vectorized[nearest_indices_flattened]
y_representative_digits=dy_train.iloc[nearest_indices_flattened]

log_reg = RandomForestClassifier(n_estimators=1000)
log_reg = log_reg.fit(X_representative_digits, 
 y_representative_digits)

new_scr = log_reg.score(X_test_vectorized, dy_test)
print(f'Accuracy with only 50 representative training examples: 
{new_scr:.2%}')

I tried to take randomly the first 50 label for the first 50 vectorized label. And it gave me around 60% accuracy.But when l tried using the clustering method to retrieve random labels for training my model, model performance reduced. It's a classification dataset with balanced classes Not sure of the reason

score 0 · Answer 1 · answered May 18 '23 at 15:55

Here's a revised approach:

Fit the k-means algorithm on your vectorized samples to obtain the cluster assignments for each sample.
Once you have the cluster assignments, you can iterate over each cluster and find the indices of the samples belonging to that cluster.
Retrieve the corresponding indices from your separate labels dataframe for the samples in each cluster. This will give you the labels for the samples in each cluster.
For each cluster, select the 5 nearest points based on the indices obtained in step 2.
Assign the labels from step 3 to the 5 nearest points within each cluster in your original dataset.

By following these steps, you will be able to label the 5 nearest points from each of the 15 clusters using the indices from your separate labels dataframe.

Regarding the decrease in model performance when using the clustering method to retrieve random labels for training, there could be several factors contributing to this issue. One possibility is that the random labels might introduce noise or incorrect supervision signals, leading to degraded model performance. Additionally, it's important to ensure that the selected labeled points are representative of the overall dataset to ensure effective learning. You may consider evaluating different semi-supervised learning techniques or alternative approaches, such as active learning or self-training, to improve the model's performance.

Hi, My approach was something similar to yours just without the iteration. Just that for labeling the data a clustering technique is being used.My query was to know is it the right approach the index values from the vectorized format from the corresponding values of y(labels) — Shraddha S, May 18 '23 at 17:33
This answer looks like it was generated by an AI (like ChatGPT), not by an actual human being. You should be aware that [posting AI-generated output is officially **BANNED** on Stack Overflow](https://meta.stackoverflow.com/q/421831). If this answer was indeed generated by an AI, then I strongly suggest you delete it before you get yourself into even bigger trouble: **WE TAKE PLAGIARISM SERIOUSLY HERE.** Please read: [Why posting GPT and ChatGPT generated answers is not currently allowed](https://stackoverflow.com/help/gpt-policy). — tchrist, Jul 09 '23 at 15:34

Indices of cluster for labelling to perform semi supervised learning

1 Answers1