1

Using : Python 3.4

I am trying to make use of the wikipedia scripts/modules from here :

http://pastebin.com/FVDxLWNG (wikipedia.py)

http://pastebin.com/idw8vQQK (wiki2plain.py)

The issue I have is with following code :

def article(self, article):
    url = self.url_article % (self.lang, urllib.parse.quote_plus(article))
    content = self.__fetch(url).read()

    if content.upper().startswith("#REDIRECT"): 
        match = re.match('(?i)#REDIRECT \[\[([^\[\]]+)\]\]', content)

        if not match == None:
            return self.article(match.group(1))

        raise WikipediaError('Can\'t found redirect article.')

    return content

If I run this I get the error : "startswith first arg must be bytes or a tuple of bytes, not str" , so I change it to

if content.upper().startswith(b"#REDIRECT"): 

And it runs OK. Then , I get "TypeError: can't use a string pattern on a bytes-like object" somewhere along the line when I try to use it. I already changed a bit of the script to work in 3.4 but I just don't seem to get this working. How do I resolve this TypeError issue on startswith?

File "C:\Anaconda3\lib\re.py", line 179, in sub return _compile(pattern, flags).sub(repl, string, count)

TypeError: can't use a string pattern on a bytes-like object

jaesson1985
  • 718
  • 2
  • 11
  • 19
  • How do you know your `TypeError` is with `startswith`? Given that it's coming from `re.py` and mentions `_compile` and a "string pattern", maybe it's the next `re.match` line? – Amit Kumar Gupta Apr 22 '15 at 08:38
  • Try making that string in the next line a bytestring-. It will probably also need to be a raw string (`r"foo"`). – L3viathan Apr 22 '15 at 08:39
  • Probably you should also change regexp: `match = re.match(b'(?i)#REDIRECT \[\[([^\[\]]+)\]\]', content)` – Esdes Apr 22 '15 at 09:03
  • @AmitKumarGupta : You are correct. The re.match is the problem. I ended up changing the code by converting the data variable mined from wiki with `str(variablename)` and THEN only going on with the code like before. I checked this and it was really a gigantic bytestring once data is retrieved. Converting to string did the trick. Thanks for the help ! – jaesson1985 Apr 22 '15 at 09:51
  • The `b` fixed the TypeError for the `if` condition, but then the `re.match` line was reached, causing another TypeError for the same reason. – Karl Knechtel Sep 17 '22 at 00:55

1 Answers1

3

You should decode content by a proper encoding.

Instead of content = self.__fetch(url).read() try this:

result = self.__fetch(url)
content = result.read().decode(result.headers.get_content_charset())
Aleksandr K.
  • 528
  • 2
  • 12