2

I am doing some natural language processing on some twitter data. So I managed to successfully load and clean up some tweets and placed it into a data frame below.

id                    text                                                                          
1104159474368024599 repmiketurner the only time that michael cohen told the truth is when he pled that he is guilty also when he said no collusion and i did not tell him to lie
1104155456019357703 rt msnbc president trump and first lady melania trump view memorial crosses for the 23 people killed in the alabama tornadoes t

The problem is that I am trying to construct a term frequency matrix where each row is a tweet and each column is the value that said word occurs in for a particular row. My only problem is that other post mentioning term frequency distribution text files. Here is the code I used to generate the data frame above

import nltk.classify
from nltk.tokenize import word_tokenize
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
df_tweetText = df_tweet
#Makes a dataframe of just the text and ID to make it easier to tokenize
df_tweetText = pd.DataFrame(df_tweetText['text'].str.replace(r'[^\w\s]+', '').str.lower())

#Removing Stop words
#nltk.download('stopwords')
stop = stopwords.words('english')
#df_tweetText['text'] = df_tweetText.apply(lambda x: [item for item in x if item not in stop])
#Remove the https linkes
df_tweetText['text'] = df_tweetText['text'].replace("[https]+[a-zA-Z0-9]{14}",'',regex=True, inplace=False)
#Tokenize the words
df_tweetText

At first I tried to use the function word_dist = nltk.FreqDist(df_tweetText['text']) but It would end up counting the value of the entire sentence instead of each word in the row.

Another thing I had tried was to tokenize each word using df_tweetText['text'] = df_tweetText['text'].apply(word_tokenize) then call FeqDist again but that gives me an error saying unhashable type: 'list'.

1104159474368024599 [repmiketurner, the, only, time, that, michael, cohen, told, the, truth, is, when, he, pled, that, he, is, guilty, also, when, he, said, no, collusion, and, i, did, not, tell, him, to, lie]
1104155456019357703 [rt, msnbc, president, trump, and, first, lady, melania, trump, view, memorial, crosses, for, the, 23, people, killed, in, the, alabama, tornadoes, t]

Is there some alternative way for trying to construct this term frequency matrix? Ideally, I want my data to look something like this

id                  |collusion | president |
------------------------------------------ 
1104159474368024599 |  1       |     0     |
1104155456019357703 |  0       |     2     |

EDIT 1: So I decided to take a look at the textmining library and recreated one of their examples. The only problem is that It creates the Term Document Matrix with one row of every single tweet.

import textmining
#Creates Term Matrix 
tweetDocumentmatrix = textmining.TermDocumentMatrix()
for column in df_tweetText:
    tweetDocumentmatrix.add_doc(df_tweetText['text'].to_string(index=False))
#    print(df_tweetText['text'].to_string(index=False))

for row in tweetDocumentmatrix.rows(cutoff=1):
    print(row)

EDIT2: So I tried SKlearn but that sortof worked but the problem is that I'm finding chinese/japanese characters in my columns which does should not exist. Also my columns are showing up as numbers for some reason

from sklearn.feature_extraction.text import CountVectorizer

corpus = df_tweetText['text'].tolist()
vec = CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)

      00  007cigarjoe  08  10  100  1000  10000  100000  1000000  10000000  \
0      0            0   0   0    0     0      0       0        0         0   
1      0            0   0   0    0     0      0       0        0         0   
2      0            0   0   0    0     0      0       0        0         0  
greatFritz
  • 165
  • 1
  • 14

1 Answers1

1

Probably not optimal by iterating over each row, but works. Milage may vary based on how long tweets are and how many tweets are being processed.

import pandas as pd
from collections import Counter

# example df
df = pd.DataFrame()
df['tweets'] = [['test','xd'],['hehe','xd'],['sam','xd','xd']]

# result dataframe
df2 = pd.DataFrame()
for i, row in df.iterrows():
    df2 = df2.append(pd.DataFrame.from_dict(Counter(row.tweets), orient='index').transpose())
Tyler K
  • 328
  • 2
  • 7
  • I just gave this piece of code a shot. It is pretty slow since I have roughly 13,000~ tweets to deal with. I've tried other libraries like textmining. the problem is that it concats all my words and values to one row. – greatFritz Mar 13 '19 at 01:17