4

I am attempting to build a naive Bayes model for text classification.

Here is a sample of the data I'm working with:

df_some_observations = filtered_training.sample(frac=0.0001)
df_some_observations.to_dict()

The output looks like this:

{'Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_x': {40219: 'aegua00268 format oper scad htbhta fonction avance',
  16820: 'aeedf50490 sort conflit facon construct',
  24771: '4022mps192 prepar a lhabilit electr boho indic v personnel non elec',
  34482: '3095mceg73 affirmezvous relat professionnel bas ref 7114'},
 'Nœud parent au niveau N y compris moi-même.1': {40219: 'distribu electricit rel reseau electricit ecr exploit conduit reseau electricit',
  16820: 'ct competent transvers rhu ressourc humain for pilotag gestion format',
  24771: 'ss sant securit prevent prf prevent risqu professionnel hcp habilit certif perm prevent risqu meti',
  34482: 'nan'},
 'Thème de formation (Chemin complet)': {40219: 'distribu electricit rel reseau electricit ecr exploit conduit reseau electricit',
  16820: 'ct competent transvers rhu ressourc humain for pilotag gestion format',
  24771: 'ss sant securit prevent prf prevent risqu professionnel hcp habilit certif perm prevent risqu meti',
  34482: 'in ingenier esp equip sous pression'},
 'Description du champ supplémentaire : Objectifs de la formation': {40219: 'nan',
  16820: 'nan',
  24771: 'prepar a lhabilit electr boho indic v autoris special lissu cet format stagiair doit connaitr risqu electr savoir sen proteg doit etre capabl deffectu oper simpl dexploit suiv certain methodolog',
  34482: 'nan'},
 'Objectifs': {40219: 'nan', 16820: 'nan', 24771: 'nan', 34482: 'nan'},
 'Programme de formation': {40219: 'nan',
  16820: 'nan',
  24771: 'notion elementair delectricit sensibilis risqu electr prevent risqu electr publiqu utec 18 510 definit oper lenviron intervent tbt b appareillag electr bt materiel protect individuel collect manoeuvr mesurag essais verif outillag electr portat a main mis situat coffret didact',
  34482: 'nan'},
 'Populations concernées': {40219: 'nan',
  16820: 'nan',
  24771: 'personnel electricien effectu oper dordr electr',
  34482: 'nan'},
 'Prérequis': {40219: 'nan',
  16820: 'nan',
  24771: 'personnel non electricien effectu oper simpl remplac fusibl rearm disjoncteur rel thermiqu',
  34482: 'nan'},
 "Description du champ supplémentaire : Commanditaire de l'action": {40219: 'nan',
  16820: 'nan',
  24771: 'nan',
  34482: 'nan'},
 "Organisme dispensant l'action": {40219: 'local sei',
  16820: 'intern edf',
  24771: 'intern edf',
  34482: 'intern edf'},
 'Durée théorique (h)': {40219: 14.0, 24771: 11.0, 34482: 14.0},
 'Coût de la catégorie Coût pédagogique': {40219: 0.0,
  16820: 0.0,
  24771: 0.0,
  34482: 0.0},
 'Coût de la catégorie Coût logistique': {40219: 0.0,
  16820: 0.0,
  24771: 0.0,
  34482: 0.0},

I started by splitting the data after removing some unnecessary columns:

(my target variable is in column 15)

df_training = filtered_training.sample(frac=0.8, random_state=42) 
df_test = filtered_training.drop(df_training.index)
X_train = df_training.iloc[:,:14]
y_train = df_training.iloc[:,15]
X_test = df_test.iloc[:,:14]
y_test = df_test.iloc[:,15]

When building the model with:

model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(X_train, y_train)
predicted_categories = model.predict(X_test)

I receive the following error when executing model.fit(X_train, y_train):

ValueError: Found input variables with inconsistent numbers of samples: [14, 35478]

Additional information that may be helpful:

np.shape(X_train) #(35478, 14)
np.shape(y_train) #(35478,)
np.shape(X_test) #(8870, 14)
np.shape(y_test) #(8870,)
Wajih101
  • 11
  • 7
  • Looks like you've got multiple text columns in your dataset. What exactly is the predictor/independent variable in this case? – A.T.B Jun 16 '23 at 19:04
  • @A.T.B If you are asking about the nature of the predictors they are string and numeric types. – Wajih101 Jun 18 '23 at 10:23
  • And do you require all these predictors in order to train the model? Because if you have numeric data as well, this is no longer strictly text classification – A.T.B Jun 18 '23 at 15:47
  • @A.T.B No, I can get rid of all of the numeric data. – Wajih101 Jun 18 '23 at 21:07
  • Can you add df_some_observations.head()? Instead of the dict output. Also, it would be helpful if you could upload a link to the dataset. – tonygrey Jun 19 '23 at 08:46
  • I think this operation df_training.iloc[:,:14] is a problem, and you need to do df_training.iloc[:,:14].values, similarly for all the other places.TfidfVectorizer expects a list of text as input. So, you need to convert the input into a list of text. – tonygrey Jun 19 '23 at 08:54
  • @tonygrey I modified the code to only take the string columns into account ````X_train = df_training.iloc[:, :14].select_dtypes(include='object') y_train = df_training.iloc[:, 15].astype(str)```` but I still get the same error – Wajih101 Jun 19 '23 at 11:27
  • @Wajih101 Have you joined all the columns into a single string? otherwise, each sample will be a list of text, and it will have different dimensions. – tonygrey Jun 19 '23 at 13:10
  • @tonygrey I used df_training.iloc[:,:14].values and I got this error now : ````AttributeError: 'numpy.ndarray' object has no attribute 'lower'```` – Wajih101 Jun 19 '23 at 13:12
  • @tonygrey Do you mean I should do this before training the model ````X_train_strings = [' '.join(row) for row in X_train] X_test_strings = [' '.join(row) for row in X_test]````? – Wajih101 Jun 19 '23 at 13:21
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254150/discussion-between-tonygrey-and-wajih101). – tonygrey Jun 19 '23 at 15:54

1 Answers1

1

I think that the main problem that TfidfVectorizer is able to work with one-dimensional text data only (as I see it from here). That's why when it tries to convert several columns with text data it tries to do it for column names for some reason.

In your case I see 2 ways how to solve this problem:

  1. If you want to apply TfidfVectorizer for each column individually, it would be better to do it like this for example:
column_transformer = ColumnTransformer([(x, TfidfVectorizer(), x) for x in X_train.columns]) # make sure that all columns contains text data
model = make_pipeline(column_transformer, MultinomialNB())
model.fit(X_train, y_train)
predicted_categories = model.predict(X_test)
  1. But if you want to apply one vocabulary for your columns, then I would recomment to do it like this:
nex_X_train = X_train.iloc[:,0]
for x in X_train.columns[1:]:
    nex_X_train = nex_X_train + ' ' + X_train[x]

nex_X_test = X_test.iloc[:,0]
for x in X_test.columns[1:]:
    nex_X_test = nex_X_test + ' ' + X_test[x]
    
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(nex_X_train, y_train)
predicted_categories = model.predict(nex_X_test)
MaryRa
  • 463
  • 1
  • 4