0

I have a csv file named movie_reviews.csv and the data inside looks like this:

1 Pixar classic is one of the best kids' movies of all time. 1 Apesar de representar um imenso avanço tecnológico, a força 1 It doesn't enhance the experience, because the film's timeless appeal is down to great characters and wonderful storytelling; a classic that doesn't need goggles or gimmicks. 1 As such Toy Story in 3D is never overwhelming. Nor is it tedious, as many recent 3D vehicles have come too close for comfort to. 1 The fresh look serves the story and is never allowed to overwhelm it, leaving a beautifully judged yarn to unwind and enchant a new intake of young cinemagoers. 1 There's no denying 3D adds extra texture to Pixar's seminal 1995 buddy movie, emphasising Buzz and Woody's toy's-eye- view of the world. 1 If anything, it feels even fresher, funnier and more thrilling in today's landscape of over-studied demographically correct moviemaking. 1 If you haven't seen it for a while, you may have forgotten just how fantastic the snappy dialogue, visual gags and genuinely heartfelt story is. 0 The humans are wooden, the computer-animals have that floating, jerky gait of animated fauna. 1 Some thrills, but may be too much for little ones. 1 Like the rest of Johnston's oeuvre, Jumanji puts vivid characters through paces that will quicken any child's pulse. 1 "This smart, scary film, is still a favorite to dust off and take from the ""vhs"" bin" 0 All the effects in the world can't disguise the thin plot.

the first columns with 0s and 1s is my label.

I want to first turn the texts in movie_reviews.csv into vectors, then split my dataset based on the labels (all 1s to train and 0s to test). Then feed the vectors into a classifier like random forest.

YixinZhu
  • 11
  • 1
  • 2

1 Answers1

0

For such a task you'll need to parse your data first with different tools. First lower-case all your sentences. Then delete all stopwords (the, and, or, ...). Tokenize (an introduction here: https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3). You can also use stemming in order to keep anly the root of the word, it can be helpful for sentiment classification.

Then you'll assign an index to each word of your vocabulary and replace words in your sentence by these indexes :

Imagine your vocabulary is : ['i', 'love', 'keras', 'pytorch', 'tensorflow']

  • index['None'] = 0 #in case a new word is not in your vocabulary

  • index['i'] = 1

  • index['love'] = 2

  • ...

Thus the sentence : 'I love Keras' will be encoded as [1 2 3]

However you have to define a maximum length max_len for your sentences and when a sentence contain less words than max_len you complete your vector of size max_len by zeros.

In the previous example if your max_len = 5 then [1 2 3] -> [1 2 3 0 0].

This is a basic approach. Feel free to check preprocessing tools provided by libraries such as NLTK, Pandas ...