1

I am going to use the wiktionary dump for the purpose of POS tagging. Somehow it gets stuck when downloading. Here is my code:

import nltk
from urllib import urlopen
from collections import Counter
import gzip

url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz'
fStream = gzip.open(urlopen(url).read(), 'rb')
dictFile = fStream.read()
fStream.close()

text = nltk.Text(word.lower() for word in dictFile())
tokens = nltk.word_tokenize(text)

Here is the error I get:

Traceback (most recent call last):
File "~/dir1/dir1/wikt.py", line 15, in <module>
fStream = gzip.open(urlopen(url).read(), 'rb')
File "/usr/lib/python2.7/gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "/usr/lib/python2.7/gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: file() argument 1 must be encoded string without NULL bytes, not str
Process finished with exit code 1
deadly
  • 1,194
  • 14
  • 24
Alex
  • 11
  • 5

1 Answers1

6

You are passing the downloaded data to gzip.open(), which expects to be passed a filename instead.

The code then tries to open a filename named by the gzipped data, and fails.

Either save the URL data to a file, then use gzip.open() on that, or decompress the gzipped data using the zlib module instead. 'Saving' the data can be as easy as using a StringIO.StringIO() in-memory file object:

from StringIO import StringIO
from urllib import urlopen
import gzip


url = 'http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-all-titles-in-ns0.gz'
inmemory = StringIO(urlopen(url).read())
fStream = gzip.GzipFile(fileobj=inmemory, mode='rb')
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Ah, StringIO, how I love you. – Slater Victoroff Aug 09 '13 at 13:29
  • you could [pass `urlopen()` object as `fileobj` directly](http://stackoverflow.com/a/26435241/4279) – jfs Jun 11 '15 at 12:12
  • @J.F.Sebastian: `GzipFile` expects to be able to seek in the file, the `urlopen()` file object doesn't support seeking. – Martijn Pieters Jun 11 '15 at 13:36
  • @MartijnPieters: It was a bug that it is fixed in Python 3.2 and backported to Python 2 – jfs Jun 11 '15 at 13:38
  • @J.F.Sebastian: you mean [this issue](http://bugs.python.org/issue1675951)? I don't see that backported, but I may be missing something. – Martijn Pieters Jun 11 '15 at 13:47
  • @J.F.Sebastian: that change was never backported as far as I can make out. There are still plenty of seek calls in Python 2.7 `gzip.GZipFile` when reading, and without a `StringIO` wrapper you cannot decompress data loaded from a URL with `urlopen`. – Martijn Pieters Jun 11 '15 at 13:57
  • @MartijnPieters: yes. It fails on Python 2. I've only run it on Python 3. It means http://bugs.python.org/issue914340 was closed by mistake. – jfs Jun 11 '15 at 14:06
  • It demonstrates yet again that you can't believe anyone until you've tried the code yourself. – jfs Jun 11 '15 at 15:05