how do I fix error "cannot use a string pattern on a bytes-like object"?

Question

I am trying to read and convert pdf file to text by following this tutorial but i keep getting error. here is my python code

import PyPDF2 
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
 
if text != "":
   text = text
 
else:
   text = textract.process(fileurl, method='tesseract', language='eng')
 
 
tokens = word_tokenize(text)
 
punctuations = ['(',')',';',':','[',']',',']
 
stop_words = stopwords.words('english')
 
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

the error I keep getting is

tokens = word_tokenize(text)

TypeError: cannot use a string pattern on a bytes-like object

how can I fix the error?

Possible duplicate of [TypeError: can't use a string pattern on a bytes-like object in re.findall()](https://stackoverflow.com/questions/31019854/typeerror-cant-use-a-string-pattern-on-a-bytes-like-object-in-re-findall) — MyNameIsCaleb, Sep 25 '19 at 02:34
Check the duplicate. `word_tokenize` uses `regex` on the backend so this solution will work for you as well. — MyNameIsCaleb, Sep 25 '19 at 02:34
@MyNameIsCaleb I reviewed the answer you referenced but I don't know how to apply to my situation — e.iluf, Sep 25 '19 at 02:35
Try will check `type(text)` . Maybe it will show **bytes** . If it will show **bytes** you need to use encoding. — Md Jewele Islam, Sep 25 '19 at 02:37
Great, I added it as an answer to be more clear for future people that find this question. — MyNameIsCaleb, Sep 25 '19 at 02:43

score 3 · Accepted Answer · answered Sep 25 '19 at 02:42

3

You are reading in bytes but you need a string because word_tokenize uses regex in the backend.

Change this line:

tokens = word_tokenize(text.decode("utf-8"))

answered Sep 25 '19 at 02:42

MyNameIsCaleb

4,409
1
13
31

how do I fix error "cannot use a string pattern on a bytes-like object"?

1 Answers1