re.findall in Python 3

Question

I wanted to use the function re.findall(), which searches through a webpage for a certain pattern:

from urllib.request import Request, urlopen
import re


url = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/20.0.1'})
webpage = urlopen(url).read()

findrows = re.compile('<td class="cmeTableCenter">(.*)</td>')
row_array = re.findall(findrows, webpage) #ERROR HERE

I get an error:

TypeError: can't use a string pattern on a bytes-like object

The `webpage` variable is NOT `str`, it has the type `bytes` so you need to decode it first `webpage.decode('utf-8')` — treecoder, May 18 '13 at 20:36
I would use BeautifulSoup for this problem. Regex and HTML don't mix well — Jason Sperske, May 18 '13 at 20:39

score 6 · Accepted Answer · answered May 18 '13 at 20:39

urllib.request.urlopen returns a bytes object, not a (Unicode) string. You should decode it before trying to match anything. For example, if you know your page is in UTF-8:

webpage = urlopen(url).read().decode('utf8')

Better HTTP libraries will automatically do this for you, but determining the right encoding isn't always trivial or even possible, so Python's standard library doesn't.

Another option is to use a bytes regex instead:

findrows = re.compile(b'<td class="cmeTableCenter">(.*)</td>')

This is useful if you don't know the encoding either and don't mind working with bytes objects throughout your code.

If I wanted to copy information from the provided website in between tags [rows of info.] , would this work? findrows = re.compile(' — Josh, May 18 '13 at 20:52

score 2 · Answer 2 · answered May 18 '13 at 20:36

2

You need to decode the bytes object first:

data = urlopen(url).read()
webpage = data.decode('utf-8')  #converts `bytes` to `str`
findrows.findall(webpage)

answered May 18 '13 at 20:36

Ashwini Chaudhary

244,495
58
464
504

score 0 · Answer 3 · edited May 18 '13 at 21:49

0

Alternative you can compile a bytes regexp:

re.compile(b"yourpatternhere")

edited May 18 '13 at 21:49

Regexident

29,441
10
93
100

answered May 18 '13 at 21:02

Max

10,701
2
24
48

re.findall in Python 3

3 Answers3