Unable to parse multiple files in a directory

Question

I have html files on my local harddrive that I am trying to open in a webpage by sending a http request.
Once the http request is created, I am trying to parse the stored html file by passing the url:(parsing is successful when passing one file at a time but I want to do it dynamically for all the files in a directory so used for loop. This doesn't workout)

once the parsing is done, I am saving the data to json file.(works fine) I have pasted the code here:

import json
import os
from newspaper import Article
import newspaper

# initiating the server
server_start = os.system('start "HTTP Server on port 8000" cmd.exe /c {python -m http.server}')
http_server = 'http://localhost:8000/'
links = ''
path = "<path>"
for f in os.listdir(path):
    if f.endswith('.html'):
        links = http_server + path + f

    blog_post = newspaper.build(links)

    for article in blog_post.articles:
        print(article.url)

    article = Article(links)
    article.download('')
    article.parse()
    data = {"HTML": article.html, "author": article.authors, "title": article.title, "text": article.text, "date": str(article.publish_date)}

    json_data = json.dumps(data)
    with open('data.json', 'w') as outfile:
        json.dump(data, outfile)

Error message:

...\newspaper\Scripts\python.exe ".../parsing_newspaper/test1.py" [Source parse ERR] http://localhost:8000/.../cnnpolitics-russian.html Traceback (most recent call last):

File"...\newspaper\lib\site-packages\newspaper\parsers.py", line 68, in fromstring cls.doc = lxml.html.fromstring(html)

File "...\newspaper\lib\site-packages\lxml\html__init__.py", line 876, in fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)

File "...\newspaper\lib\site-packages\lxml\html__init__.py", line 762, in document_fromstring value = etree.fromstring(html, parser, **kw)

File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:78994)

File "src\lxml\parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118325)

File "src\lxml\parser.pxi", line 1729, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:116883)

File "src\lxml\parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:110870)

File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)

File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)

File "src\lxml\parser.pxi", line 646, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105947)

File "", line 0 lxml.etree.XMLSyntaxError:

You must download()an article before calling parse() on it!

Traceback (most recent call last): File ".../test1.py", line 26, in article.parse()

File "...\newspaper\lib\site-packages\newspaper\article.py", line 168, in parse raise ArticleException() newspaper.article.ArticleException

score 1 · Answer 1 · answered Dec 05 '17 at 13:07

Don't know if this helps but try this:

import json
import os
from newspaper import Article
import newspaper

# initiating the server
server_start = os.system('start "HTTP Server on port 8000" cmd.exe /c {python -m http.server}')
http_server = 'http://localhost:8000/'
links = ''
path = "<path>"
for f in os.listdir(path):
    if f.endswith('.html'):
       links = http_server + path + f

       blog_post = newspaper.build(links)

       for article in blog_post.articles:
       print(article.url)

       article = Article(links)
       article.download('')
       article.parse()
       data = {"HTML": article.html, "author": article.authors, "title": article.title, "text": article.text, "date": str(article.publish_date)}

       json_data = json.dumps(data)
       with open('data.json', 'w') as outfile:
       json.dump(data, outfile)

Because otherwise if the first file is not a file with html extension, then you try to build an empty string.

or if the first one is a file with html extension but second one is not than you are going to build the same file (at least) twice

score 0 · Answer 2 · answered Mar 10 '17 at 15:57

0

A check list to follow before going deeper into debugging:

Check if a html is not empty
CHeck if ahtml is "well-formed"
Check if an artical is not empty
check if an artical is downloaded(that what the function parse() do, but that helps you to isolate "problematic" articles)

answered Mar 10 '17 at 15:57

sdikby

1,383
14
30

1. The html is not empty for sure. – Abdul Fattah Mohammed Mar 10 '17 at 16:08
Thank you for your response @sdikby 1. The html is not empty for sure. 2. I have saved the webpages(html) from the website directly. 3. further in the task, the html files will be scraped from the web using scrapy and the webpages will be stored on the localdisk. I'm not sure if the articles will be downloaded and also I did not quite understand what you mean by "problematic" articles. – Abdul Fattah Mohammed Mar 10 '17 at 16:15
sorry i didn't find a better term to describe what i mean. By "problematic" i meant the articles that could not be downloaded for some reasons (exceptions) that `newspaper` define. There is the class parameter self._downloaded (i think)that you can test upon. Or maybe i am missing something and what i suggested don't apply at all – sdikby Mar 10 '17 at 18:14
In short, you can't parse an article that could not be downloaded. Yhat you can also add is surround artcile.download() and .parse with `try..except` – sdikby Mar 10 '17 at 18:17

Unable to parse multiple files in a directory

2 Answers2