Im learning how to convert text into numbers for NLP problems and following a course Im learning about word vectors provided by Spacy package. the code works all fine from learning and evaluation but I have some problems regarding:
making prediction for new sentences, I cannot seems to make it work and most examples just fit the model then use X_test set for evaluation. ( Code below)
The person explaining stated that its bad( won't give good results) if I used
"" doc.vector over doc.vector.values
""
when trying both I don't see a difference, what is the difference between the two?
the example is to classify news title between fake and real
import spacy
import pandas as pd
df= pd.read_csv('Fake_Real_Data.csv')
print(df.head())
print(f"shape is: {df.shape}")
print("checking the impalance: \n ", df.label.value_counts())
df['label_No'] = df['label'].map({'Fake': 0, 'Real': 1})
print(df.head())
nlp= spacy.load('en_core_web_lg') # only large and medium model have word vectors
df['Text_vector'] = df['Text'].apply(lambda x: nlp(x).vector) #apply the function to EACH element in the column
print(df.head(5))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test= train_test_split(df.Text_vector.values, df.label_No, test_size=0.2, random_state=2022)
x_train_2D= np.stack(X_train)
x_test_2D= np.stack(X_test)
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB()
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
scaled_train_2d= scaler.fit_transform(x_train_2D)
scaled_test_2d= scaler.transform(x_test_2D)
clf.fit(scaled_train_2d, y_train)
from sklearn.metrics import classification_report
y_pred=clf.predict(scaled_test_2d)
print(classification_report(y_test, y_pred))