1

I have a MySQL Table full of crawled news article HTML data. I would like to extract article texts with newspaper3k module which I have done many times before.

The only difference now is that I am not extracting an URL and parse the result with Newspaper but I pull raw HTML strings from a MySQL DB.

Somehow Newspaper (or Goose) doesn't like the string from the DB as the returned article.text is always ''.

However when I use a URL with requests.get and feed the raw HTML to Newspaper it works. So I'm guessing that the data from MySQL is formatted/encoded differently so that Newspaper does not understand it as HTML?!

When I print data from the DB it looks like:

<!DOCTYPE html>\n<html lang="de">\n<head>\n\n<...

While the html via requests.get looks like:

<!DOCTYPE html>
<html lang="de">
<head>

<meta charset="utf-8">
<!-- 
    This website is powered by TYPO3 - inspiring people to share!
    TYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL.
    TYPO3 is copyright 1998-2016 of Kasper Skaarhoj. Extensions are copyright of their respective owners.
    Information and contribution at http://typo3.org/
--> ...
hag o hi
  • 117
  • 1
  • 1
  • 9
  • Do you want a Python solution or a MySQL (SQL) solution? –  Sep 04 '18 at 10:29
  • Obviously the article Exctraction should be handled in Python. However reformatting the string could happen in MySQL. I don't really care. – hag o hi Sep 04 '18 at 10:31
  • I think you should file an issue on https://github.com/codelucas/newspaper/issues – David Sep 15 '18 at 12:32
  • Perhaps check if your issue could be related to this: https://github.com/codelucas/newspaper/issues/605 – David Sep 15 '18 at 12:34

2 Answers2

1

You get the header of a TYPO3 page. Maybe the default 404 page. (get the complete HTML)

If your request should be served by anything else than TYPO3 you miss the (htaccess-)configuration (by default TYPO3 answeres every request as long as there is no static file with the URL-request path)

Or you expect a TYPO3 server to answer you with something else than a complete page (AJAX: HTML-Snippet or JSON?)?
Then you probably have not the correct configuration in TYPO3 to omit headers.

As TYPO3 is involved you might tag your question also with TYPO3

Bernd Wilke πφ
  • 10,390
  • 1
  • 19
  • 38
  • 1
    Thanks for your Answer, but this does not help at all. – hag o hi Sep 11 '18 at 10:37
  • TYPO3 does not store the complete HTML markup of a page in the database, so you never would get a plain html like your first example as a result from any DB-query. you otherwise give not more information to identify anything and to give further hints, especially what TYPO3 has to do with your output. Do not expect more hints to solve your problem. – Bernd Wilke πφ Sep 11 '18 at 12:13
0

I solved it myself. Thanks everyone.

Turned out I just needed to use BeautifulSoup on the HTML from the Database to parte it as soup. Now it works.

hag o hi
  • 117
  • 1
  • 1
  • 9