I am trying to do a clustering analysis (preferably k-means) of poetry words on a pandas dataframe. I am firstly trying to vectorize the words by using the word-to-vector feature in the gensim package. However, the vectors just come out with 0s, so my code is failing to translate the words into vectors. As a result, the clustering doesn't work. Here is my code:
# create a gensim model
model = gensim.models.Word2Vec(vector_size=100)
# copy original pandas dataframe with poems
data = poems.copy(deep=True)
# get data ready for kmeans clustering
final_data = [] # empty list
for i, row in data.iterrows():
poem_vectorized = []
poem = row['Main_text']
poem_all_words = poem.split(sep=" ")
for poem_w in poem_all_words: #iterate through list of words
try:
poem_vectorized.append(list(model.wv[poem_w]))
except Exception as e:
pass
try:
poem_vectorized = np.asarray(poem_vectorized)
poem_vectorized_mean = list(np.mean(poem_vectorized, axis=0))
except Exception as e:
poem_vectorized_mean = list(np.zeros(100))
pass
try:
len(poem_vectorized_mean)
except:
poem_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(poem_vectorized_mean)
final_data.append(temp_row)
X = np.asarray(final_data)
print(X)
At closer inspection of:
poem_vectorized.append(list(model.wv[poem_w]))