2

I wanted to use the function re.findall(), which searches through a webpage for a certain pattern:

from urllib.request import Request, urlopen
import re


url = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/20.0.1'})
webpage = urlopen(url).read()

findrows = re.compile('<td class="cmeTableCenter">(.*)</td>')
row_array = re.findall(findrows, webpage) #ERROR HERE

I get an error:

TypeError: can't use a string pattern on a bytes-like object
Josh
  • 3,231
  • 8
  • 37
  • 58

3 Answers3

6

urllib.request.urlopen returns a bytes object, not a (Unicode) string. You should decode it before trying to match anything. For example, if you know your page is in UTF-8:

webpage = urlopen(url).read().decode('utf8')

Better HTTP libraries will automatically do this for you, but determining the right encoding isn't always trivial or even possible, so Python's standard library doesn't.

Another option is to use a bytes regex instead:

findrows = re.compile(b'<td class="cmeTableCenter">(.*)</td>')

This is useful if you don't know the encoding either and don't mind working with bytes objects throughout your code.

Cairnarvon
  • 25,981
  • 9
  • 51
  • 65
  • If I wanted to copy information from the provided website in between tags [rows of info.] , would this work? findrows = re.compile(' – Josh May 18 '13 at 20:52
2

You need to decode the bytes object first:

data = urlopen(url).read()
webpage = data.decode('utf-8')  #converts `bytes` to `str`
findrows.findall(webpage)
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
0

Alternative you can compile a bytes regexp:

re.compile(b"yourpatternhere")
Regexident
  • 29,441
  • 10
  • 93
  • 100
Max
  • 10,701
  • 2
  • 24
  • 48