Python beginner : Preprocessing a french text in python and calculate the polarity with a lexicon

Question

I am writing an algorithm in python which processes a column of sentences and then gives the polarity (positive or negative) of each cell of my column of sentences. The script uses a list of negative and positive word from the NRC emotion lexicon (French version) I am having a problem writing the preprocess function. I have already written the count function and the polarity function but since I have some difficulty writing the preprocess function, I am not really sure if those functions works.

The positive and negative words were in the same file (lexicon) but I export positive and negztive words separately because I did not know how to use the lexicon as it was.

My function count occurrence of positive and negative does not work and I do not know why it Always sends me 0. I Added positive word in each sentence so the should appear in the dataframe:

stacktrace :


[4 rows x 6 columns]
   id                                           Verbatim      ...       word_positive  word_negative
0  15  Je n'ai pas bien compris si c'était destiné a ...      ...                   0              0
1  44  Moi aérien affable affaire agent de conservati...      ...                   0              0
2  45  Je affectueux affirmative te hais et la Foret ...      ...                   0              0
3  47  Je absurde accidentel accusateur accuser affli...      ...                   0              0

=>  
def count_occurences_Pos(text, word_list):
    '''Count occurences of words from a list in a text string.'''
    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]


    return len(intersection)
csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_Pos, args=(lexiconPos, ))

This my csv_data : line 44 , 45 contains positive words and line 47 more negative word but in the column of positive and negative word , it is alwaqys empty, the function does not return the number of words and the final column is Always positive whereas the last sentence is negative

id;Verbatim
15;Je n'ai pas bien compris si c'était destiné a rester
44;Moi aérien affable affaire agent de conservation qui ne agraffe connais rien, je trouve que c'est s'emmerder pour rien, il suffit de mettre une multiprise
45;Je affectueux affirmative te hais et la Foret enchantée est belle de milles faux et les jeunes filles sont assises au bor de la mer
47;Je absurde accidentel accusateur accuser affliger affreux agressif allonger allusionne admirateur admissible adolescent agent de police Comprends pas la vie et je suis perdue

Here the full code :

# -*- coding: UTF-8 -*-
import codecs 
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
    import treetaggerwrapper
    from treetaggerwrapper import TreeTagger, make_tags
    print("import TreeTagger OK")
except:
    print("Import TreeTagger pas Ok")

from itertools import islice
from collections import defaultdict, Counter



csv_df = pd.read_csv('test.csv', na_values=['no info', '.'], encoding='Cp1252', delimiter=';')
#print(csv_df.head())

stopWords = set(stopwords.words('french'))  
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')     
def process_text(text):
    '''extract lemma and lowerize then removing stopwords.'''

    text_preprocess =[]
    text_without_stopwords= []

    text = tagger.tag_text(text)
    for word in text:
        parts = word.split('\t')
        try:
            if parts[2] == '':
                text_preprocess.append(parts[1])
            else:
                text_preprocess.append(parts[2])
        except:
            print(parts)


    text_without_stopwords= [word.lower() for word in text_preprocess if word.isalnum() if word not in stopWords]
    return text_without_stopwords

csv_df['sentence_processing'] = csv_df['Verbatim'].apply(process_text)
#print(csv_df['word_count'].describe())
print(csv_df)


lexiconpos = open('positive.txt', 'r', encoding='utf-8')
print(lexiconpos.read())
def count_occurences_pos(text, word_list):
    '''Count occurences of words from a list in a text string.'''

    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]

    return len(intersection)


#csv_df['word_positive'] = csv_df['Verbatim'].apply(count_occurences_pos, args=(lexiconpos, ))
#print(csv_df)

lexiconneg = open('negative.txt', 'r', encoding='utf-8')

def count_occurences_neg(text, word_list):
    '''Count occurences of words from a list in a text string.'''
    text_list = process_text(text)

    intersection = [w for w in text_list if w in word_list]

    return len(intersection)
#csv_df['word_negative'] = csv_df['Verbatim'].apply(count_occurences_neg, args= (lexiconneg, ))
#print(csv_df)

def polarity_score(text):   
    ''' give the polarity of each text based on the number of positive and negative word '''
    positives_text =count_occurences_pos(text, lexiconpos)
    negatives_text =count_occurences_neg(text, lexiconneg)
    if positives_text > negatives_text :
        return "positive"
    else : 
        return "negative"
csv_df['polarity'] = csv_df['Verbatim'].apply(polarity_score)
#print(csv_df)
print(csv_df)

If you could also see if the rest of the code is good to thank you.

Espoir Murhabazi · Accepted Answer · 2019-05-22T13:08:06.883

1

I have found your error! It comes from the Polarity_score function.

It's just a typo : In your, if statement you were comparing count_occurences_Pos and count_occurences_Neg which are function instead of comparing the results of the function count_occurences_pos and count_occurences_peg

Your code should be like this :

def Polarity_score(text):
    ''' give the polarity of each text based on the number of positive and negative word '''
    count_text_pos =count_occurences_Pos(text, word_list)
    count_text_neg =count_occurences_Neg(text, word_list)
    if count_occurences_pos > count_occurences_peg :
        return "Positive"
    else : 
        return "negative"

In the future, you need to learn how to have meaningful names for your variables to avoid those kinds of errors With correct variables names, your function should be :

 def polarity_score(text):
        ''' give the polarity of each text based on the number of positive and negative word '''
        positives_text =count_occurences_pos(text, word_list)
        negatives_text =count_occurences_neg(text, word_list)
        if positives_text > negatives_text :
            return "Positive"
        else : 
            return "negative"

Another improvement you can make in your count_occurences_pos and count_occurences_neg function is to use set instead of the list. Your text and world_list can be converted to sets and you can use the set intersection to retrieve the positive texts in them.Because set are faster than lists

edited May 22 '19 at 13:08

answered May 22 '19 at 10:22

Espoir Murhabazi

5,973
5
42
73

please thank you but coul you look at my count_occurence function, it is not working – kely789456123 May 22 '19 at 11:49
1

Yes, I looked at it I didn't see any problem. I have found some improvement in it. – Espoir Murhabazi May 22 '19 at 13:01
i update my csv file, it contains positive words and line 47 more negative word but in the column of positive and negative word , it is alwaqys empty, the function does not return the number of words and the final column is Always positive whereas the last sentence is negative – kely789456123 May 22 '19 at 13:07
I have this error : File "pandas/_libs/src\inference.pyx", line 1472, in pandas._libs.lib.map_infer File "Feel.py", line 92, in Polarity_score count_text_pos =count_occurences_Pos(text, word_list) NameError: name 'word_list' is not defined – kely789456123 May 22 '19 at 13:27
yeah i know i changr but i still having the same issue , the total Numbers of positive and negative words for each sentence in a cell does not appear in the dataframe which lead to every cell being positive whereas is not true. and that's the proble i am trying to look, why the len of the intercession does not appear – kely789456123 May 22 '19 at 13:51
1

thank you i find the solution, I sould have transform the file in a list. Now it is working – kely789456123 May 22 '19 at 14:34

Python beginner : Preprocessing a french text in python and calculate the polarity with a lexicon

1 Answers1