0

I have created a data-set of various movies produced in the past few years, technicians worked for the film, genre, country it represented, runtime, language, the respective film festival that film has won, etc.

the data-set is similar to this, it is an excel file.

I'm interested in multi-label classification of the movies to film festivals based on the inherent features of the movie(irrespective of the plot)

I thought we need to work in numbers/vectors to multi-label classify the data. But, I'm unaware of how vectorization of names(proper nouns) and few individual words can be carried out.

Is there any other way I can carry out the process to achieve my goal of multi-label classification with the above data? Please help me identify it. Thank you.

Suchit Y
  • 13
  • 1

1 Answers1

0

The dataset you have here is tabular data. You need to vectorise that tabular data in order to be able to pass it to a classification model.

Tabular data is usually made of :

  1. continuous features (eg: imdb rating, runtime)
  2. categorical features (eg: every other feature in your dataset)

The vectorisation of tabular data is simply the concatenation of the vector representation of each feature. For continuous features, you should normalise the values. For categorical features you should one-hot encode them.

Note: In the case of your dataset, you have 3 "text-like" features: title, director and writer:

  • title: A title is unique to its film, so there is nothing your model can learn from this, so you should discard it from the dataset.
  • director and writer: you should treat them as categorical variables and not text. If you encoded them using text vectorisation techniques (Bag of words or TF-IDF) it would mean you assume that a word like Pedro can have predictive power. Is there a point in common between Pedro Gonzalez-Rubio and Pedro Almodovar? If there is, it's maybe that they both speak Spanish, but then I would rather add that as a feature to your model (eg: language_of_director)
louis_guitton
  • 5,105
  • 1
  • 31
  • 33