0

I'm aware that unicode was changed to str in python 3 but I keep getting the same issue no matter how I write this code, can anyone tell me why?

I'm using boilerpipe for a specific set of webcrawls:

for urls in allUrls:
    fileW = open('article('+ str(counter)+')', 'w')
    articleDate = Article(urls)
    articleDate.download()
    articleDate.parse()
    print(articleDate.publish_date)
    fileW.write(str(Extractor(extractor='ArticleExtractor', url=urls).getText() + "\n\n\n" + str(articleDate.publish_date)+"\n\n\n"))
    fileW.close
    counter +=1

error:

 Traceback (most recent call last):
  File "/Users/Adrian/anaconda3/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 45, in __init__
    self.data = unicode(self.data, encoding)
NameError: name 'unicode' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "webcrawl.py", line 26, in <module>
    fileW.write(str(Extractor(extractor='ArticleExtractor', url=urls).getText() + "\n\n\n" + str(articleDate.publish_date)+"\n\n\n"))
  File "/Users/Adrian/anaconda3/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 47, in __init__
    self.data = self.data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
zvone
  • 18,045
  • 3
  • 49
  • 77
Adrian Coutsoftides
  • 1,203
  • 1
  • 16
  • 38
  • 2
    Unrelated to your problem: `filew.close` doesn't do anything. Try `fileW.close()`. – Robᵩ Jan 12 '18 at 22:07
  • Please copy-paste the entire error message, including any traceback, into your question. – Robᵩ Jan 12 '18 at 22:07
  • @Robᵩ i havent used a lower case 'w' in any of the variable declarations? I'll add the full track back now – Adrian Coutsoftides Jan 12 '18 at 22:14
  • 1
    Rob is referring to the second to last line, `fileW.close`. You need to add parentheses to actually call the method. – Collin R Jan 12 '18 at 22:18
  • 1
    Better yet, use the `with open(..` syntax as it closes automatically when leaving its scope. – Jongware Jan 12 '18 at 22:43
  • @Robᵩ ah yes thank you very much!, usr256 and Collin R too – Adrian Coutsoftides Jan 12 '18 at 23:41
  • 1
    @AdrianCoutsoftides you're welcome! If my answer below helped you and if you don't mind selecting it as the "correct" answer, that will let viewers know the question is closed (and full disclosure: it gives me some extra reputation points). :) – Collin R Jan 13 '18 at 01:20

1 Answers1

3

The error message is pointing to a line in boilerpipe/extract/__init__.py, which makes a call to the unicode built-in function.

I assume the link below is the source code for the package you are using. If so, it appears to be written for Python 2.7, which you can see if you look near the end of this file:

https://github.com/misja/python-boilerpipe/blob/master/setup.py

You have several options as far as I can see:

  1. Find a Python 3 port of this package. There are at least a few out there (here's one and here's another).
  2. Port the package to Python 3 yourself (if that is the only error, you could simply change that line to use str, but later changes could cause problems with other parts of the package). This official tool should be of assistance; this official guide should, as well.
  3. Port you project to Python 2.7 and continue using the same package.

I hope this helps!

Collin R
  • 311
  • 2
  • 8
  • I believe my answer is complete, but if you're curious as to the second error in your output, it happens because `self.data = self.data.decode(encoding)` is in the `except` clause which executes when the call to `unicode` causes the exception. Since `unicode` calls will fail in Python 3, the script assumes that `self.data` is a byte array and tries to call its `decode` method. However, if `self.data` is a `str` then it has no `decode` method, so that method call will also fail. Seen at line 47 here: https://github.com/misja/python-boilerpipe/blob/master/src/boilerpipe/extract/__init__.py – Collin R Jan 12 '18 at 22:42
  • Perfect explanation and this solved the first issue miraculously, I just wanted to add that second error could be resolved by stripping the encoded bytearrays: – Adrian Coutsoftides Jan 13 '18 at 15:00
  • @AdrianCoutsoftides I’m really glad, thanks for the feedback! And I believe you’re right about the 2nd problem; that looks like it would work. – Collin R Jan 13 '18 at 18:40