-1

I have rows of blurbs (in text format) and I want to use tf-idf to define the weight of each word. Below is the code:

def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
df["punc_blurb"] = df["blurb"].apply(remove_punctuations)

df = pd.DataFrame(df["punc_blurb"])

vectoriser = TfidfVectorizer()
df["blurb_Vect"] = list(vectoriser.fit_transform(df["punc_blurb"]).toarray())

df_vectoriser = pd.DataFrame(x.toarray(),
columns = vectoriser.get_feature_names())
print(df_vectoriser)

All I get is a massive list of numbers, which I am not even sure anymore if its the TF or TF-IDF that it is giving me as the frequent words (the, and, etc) all have a score of more than 0.

The goal is to see the weights in the tf-idf column shown below and I am unsure if I am doing this in the most efficient way:

Goal Output table

Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
U108456
  • 17
  • 4

1 Answers1

0

You don't need punctuation remover if you use TfidfVectorizer. It will take care of punctuation automatically, by virtue of default token_pattern param:

from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.DataFrame({"blurb":["this is a sentence", "this is, well, another one"]})
vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w+\\b')
df["tf_idf"] = list(vectorizer.fit_transform(df["blurb"].values.astype("U")).toarray())
vocab = sorted(vectorizer.vocabulary_.keys())
df["tf_idf_dic"] = df["tf_idf"].apply(lambda x: {k:v for k,v in dict(zip(vocab,x)).items() if v!=0})
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • Thank you. Your code works but when I tried to change the sentences that you made up and refer to my df["blurb"] column, then the rest of the code doesn't seem to like it. Also with the output, is there a way to only get only the corresponding words to the row? For example, removing the 0.0 words because its not relevant to that row. – U108456 Oct 23 '20 at 10:01
  • Show a [reprex] please including input as a text – Sergey Bushmanov Oct 23 '20 at 10:02
  • this is when I changed the df from your sentence example to my column: df = pd.DataFrame(df1["blurb"]) vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w+\\b') df["tf_idf"] = list(vectorizer.fit_transform(df["blurb"]).toarray()) ValueError: np.nan is an invalid document, expected byte or unicode string – U108456 Oct 23 '20 at 10:37
  • I showed you a working example. There is no way to help you without seeing your data – Sergey Bushmanov Oct 23 '20 at 10:38
  • It does not let me show the data. The data is simple, it is an excel sheet with two columns and 7 rows. One column for book names and the other column is for book blurbs. each row is referring to a different book. – U108456 Oct 23 '20 at 10:46
  • this has worked thanks! I will need to remove the words with scores of zero from the output. for example, 'about' is the only word with a score for that book, the other words with zero, relates to the rest of the other books. Is there a way? {'100': 0.0, '125': 0.0, '260': 0.0, '30': 0.0, '50': 0.0, 'about': 0.10926942718221799, 'acceptance': 0.0, 'achieve': 0.0, 'across': 0.0, 'actually': 0.0, 'affecting': 0.0, 'after': 0.0, 'air': 0... – U108456 Oct 23 '20 at 11:20
  • I have already done this with `{k:v for k,v in dict(zip(vocab,x)).items() if v!=0}`. If you still see `0` they are not zeroes, but values close to 0 – Sergey Bushmanov Oct 23 '20 at 11:21
  • Sorry I missed it. It works thanks! Should I be pre-processing my data to remove stopwords before applying the tf-idf function? I just looked at the scores and some scores for 'and/ the/ is/ this/ etc' seems quite high, considering tf-idf is suppose to reduce the weighting of high frequency words. – U108456 Oct 23 '20 at 12:08
  • tfidf per se does not have an opinion of what is useful or not. It's a preprocessing step among the many possible. A model you train on top your data preprocessing pipeline does have an opinion. – Sergey Bushmanov Oct 23 '20 at 12:10