0

Im learning how to convert text into numbers for NLP problems and following a course Im learning about word vectors provided by Spacy package. the code works all fine from learning and evaluation but I have some problems regarding:

  1. making prediction for new sentences, I cannot seems to make it work and most examples just fit the model then use X_test set for evaluation. ( Code below)

  2. The person explaining stated that its bad( won't give good results) if I used

"" doc.vector over doc.vector.values

""

when trying both I don't see a difference, what is the difference between the two?

the example is to classify news title between fake and real

import spacy
import pandas as pd

df= pd.read_csv('Fake_Real_Data.csv')

print(df.head())
print(f"shape is: {df.shape}")


print("checking the impalance: \n ", df.label.value_counts())



df['label_No'] = df['label'].map({'Fake': 0, 'Real': 1})
print(df.head())




nlp= spacy.load('en_core_web_lg') # only large and medium model have word vectors


df['Text_vector'] = df['Text'].apply(lambda x: nlp(x).vector) #apply the function to EACH element in the column
print(df.head(5))



from sklearn.model_selection import train_test_split

X_train, X_test, y_train,y_test= train_test_split(df.Text_vector.values, df.label_No, test_size=0.2, random_state=2022)




x_train_2D= np.stack(X_train)
x_test_2D= np.stack(X_test)



from sklearn.naive_bayes import MultinomialNB

clf=MultinomialNB()

from sklearn.preprocessing import MinMaxScaler

scaler= MinMaxScaler()

scaled_train_2d= scaler.fit_transform(x_train_2D)
scaled_test_2d= scaler.transform(x_test_2D) 

clf.fit(scaled_train_2d, y_train)

from sklearn.metrics import classification_report

y_pred=clf.predict(scaled_test_2d)

print(classification_report(y_test, y_pred))




Mira
  • 21
  • 5
  • any advice is appreciated, most examples stops at the evaluation step, none provide code for making prediction – Mira Dec 25 '22 at 10:21
  • This is not exactly a programming question, rather a question about understanding the design an implementation of this kind of ML setup. Therefore I would recommend asking on https://datascience.stackexchange.com/, it's a more appropriate forum for this kind of question. The general advice I always give is to implement a separate function for applying the model to any set of instances: this forces a clear separation with the training step (in particular use `transform` and not `fit_transform`), and it's straightforward to apply to new instances in the same way as for evaluating on test set. – Erwan Dec 25 '22 at 11:01
  • It looks like you're getting vectors from spaCy and using scikit learn to fit a model. You can do that, but spaCy has its own models you can use. Also, since you have two questions, you should split them up, and it's not clear what you're asking in 2. – polm23 Dec 26 '22 at 02:56
  • that's new to me, most examples provided in courses uses Scikit learn classifiers, i never new that's spacy has its own. I would look into it. as for my 2nd question. the guy explaining the example said when splitting the sets into features(X) and targets it is better to use `df.text_vector.values` than `df.text_vector` only but never said why – Mira Dec 26 '22 at 18:16
  • @Erwan tried it, did not work – Mira Dec 28 '22 at 07:58
  • @Mira See a [similar example with this logic](https://stackoverflow.com/a/74354303/891919). It's not easy to guess what is the problem with your code since we don't have your data. If you could give a [reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) I would take a look. – Erwan Dec 28 '22 at 10:53

0 Answers0