0

I am facing this while executing the code in Kaggle notebook

TypeError: cannot use a string pattern on a bytes-like object.

The same code is executed properly in Spyder notebook.

import nltk 
import pandas as pd
import re

messages = pd.read_csv('../input/spam.csv', sep='\t',
                           names=["label", "message"],encoding='latin-1')

print(messages)

Message out

#text preprocessing

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

lemmatizer = WordNetLemmatizer()
corpus =[]

for i in range(0,len(messages)):
    words = re.sub('[^a-zA-Z]','',messages['message'][i])
    words = words.lower()
    words = words.split()
    words = [lemmatizer.lemmatize(word) for word in words if not word in set(stopwords.words('english'))]
    words  = ''.join(words)
    corpus.append(words)

Error details:

TypeError       Traceback (most recent call last)
<ipython-input-8-715dc7ef0530> in <module>
     27 
     28 for i in range(0,len(messages)):
---> 29     words = re.sub('[^a-zA-Z]','',messages['message'][i])
     30     words = words.lower()
     31     words = words.split()

/opt/conda/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
    189     a callable, it's passed the match object and must return
    190     a replacement string to be used."""
--> 191     return _compile(pattern, flags).sub(repl, string, count)
    192 
    193 def subn(pattern, repl, string, count=0, flags=0):

TypeError: cannot use a string pattern on a bytes-like object
cannot use a string pattern on a bytes-like object

label message

0                                              v1,v2,,,      NaN
1     ham,"Go until jurong point, crazy.. Available ...      NaN
2                  ham,Ok lar... Joking wif u oni...,,,      NaN
3     spam,Free entry in 2 a wkly comp to win FA Cup...      NaN
4     ham,U dun say so early hor... U c already then...      NaN
...                                                 ...      ...
5570  spam,"This is the 2nd time we have tried 2 con...      NaN
5571       ham,Will Ì_ b going to esplanade fr home?,,,      NaN
5572  ham,"Pity, * was in mood for that. So...any ot...      NaN
5573  ham,The guy did some bitching but I acted like...      NaN
5574                  ham,Rofl. Its true to its name,,,      NaN

[5575 rows x 2 columns]
James Z
  • 12,209
  • 10
  • 24
  • 44
  • Does this answer your question? [TypeError: can't use a string pattern on a bytes-like object in re.findall()](https://stackoverflow.com/questions/31019854/typeerror-cant-use-a-string-pattern-on-a-bytes-like-object-in-re-findall) – SuperStormer Apr 07 '20 at 15:02
  • Hi, I already tried doing the above step, since I am using Dataframe, I cannot use decode function. Moreover, the error is at ---> 29 words = re.sub('[^a-zA-Z]','',messages['message'][i]) step. Is there any addition I could do here to avoid this? Because I am able to run the same code in Spyder IDE without issues. – Ankur Kumar Apr 07 '20 at 15:07
  • could you print `messages` before the loop and add to question? – Sumit Badsara Apr 07 '20 at 15:09
  • What type is `messages['message'][i]`? (Hint: not a string) – SuperStormer Apr 07 '20 at 15:11
  • @SumitBadsara, I have added an image of the output, https://i.stack.imgur.com/Yxhtt.png . Couldn't paste the print(message) output in a proper format. – Ankur Kumar Apr 07 '20 at 15:34
  • @SuperStormer : – Ankur Kumar Apr 07 '20 at 15:41

0 Answers0