4

I'm was trying to render an HTML-page on the fly using BeautifulSoup version 4 in Django (using Apache2 with mod_python). However, as soon as I pass any HTML-string to the BeautifulSoup constructor (see code below), the browser just hangs waiting for the webserver. I tried equivalent code in CLI and it works like a charm. So I'm guessing it's something related to BeautifulSoups environment, in this case Django + Apache + mod_python.

import bs4
import django.shortcuts as shortcuts

def test(request):
    s = bs4.BeautifulSoup('<b>asdf</b>')
    return shortcuts.render_to_response('test.html', {})

I have installed BeautifulSoup using pip, pip install beautifulsoup4. I tried to install BeautifulSoup3 using standard Debian packages, apt-get install python-beautifulsoup, and then the following equivalent code works fine (both from browser and CLI).

from BeautifulSoup import BeautifulSoup
import django.shortcuts as shortcuts

def test(request):
    s = BeautifulSoup('<b>asdf</b>')
    return shortcuts.render_to_response('test.html', {})

I have looked in Apaches access and error logs and they show no information what's happening to the request that gets stalled. I have also checked /var/log/syslog and /var/log/messages, but no further info.

Here's the Apache configuration I used:

<VirtualHost *:80>
    DocumentRoot /home/nandersson/src
    <Directory /home/nandersson/src>
        SetHandler python-program
        PythonHandler django.core.handlers.modpython
        SetEnv DJANGO_SETTINGS_MODULE app.settings
        PythonOption django.root /home/nandersson/src
        PythonDebug On
        PythonPath "['/home/nandersson/src'] + sys.path"
    </Directory>

    <Location "/media/">
        SetHandler None
    </Location>
    <Location "/app/poc/">
        SetHandler None
    </Location>
</VirtualHost>

I'm not sure how to debug this further, not sure if it's a bug or not. Anyone got ideas on how to get to the bottom of this or have run into similar problems?

Niklas9
  • 8,816
  • 8
  • 37
  • 60
  • show us the apache config you used. and the wsgi.py file contents as well. – andrean Sep 27 '12 at 10:55
  • I updated with this, thanks. However I'm using mod_python as mentioned so there's no wsgi.py to show. – Niklas9 Sep 27 '12 at 13:57
  • totally unrelated but you should import your packages like this: `from django import shorcuts` since you are using python2.7 – Hassek Sep 28 '12 at 02:30

4 Answers4

15

I'm using Apache2 with mod_python. I solved the hang problem by explicitly passing the 'html.parser' to get a soup.

s = bs4.BeautifulSoup('<b>asdf</b>', 'html.parser')
chx3
  • 238
  • 5
  • 10
  • This works! Thanks. But I would prefer using 'lxml' for performance, however then the same error occurs. Do you know why this happens? Is it a bug? – Niklas9 Oct 11 '12 at 08:33
  • Sorry for late reply. For better performance, I have changed to use lxml. I encounter no problem at all. What's your code? – chx3 Mar 21 '13 at 10:39
  • Ah, I realized that mod_python is out of date. mod_wsgi is the module updates are being pushed to. This problem disappeared when I replaced mod_python and works with lxml. – Niklas9 Mar 21 '13 at 11:45
2

This may be the interaction between Cython and mod_wsgi described here, and explored in a Beautiful Soup context here. Here are earlier questions similar to yours.

Community
  • 1
  • 1
Leonard Richardson
  • 3,994
  • 2
  • 17
  • 10
2

Try

doc = BeautifulSoup(html, 'html5lib')

In my cases, 'html.parser' often leads to the HTMLParseError https://groups.google.com/forum/?fromgroups=#!topic/beautifulsoup/x_L9FpDdqkc

goodhyun
  • 4,814
  • 3
  • 33
  • 25
1

I've experienced the same issue about a year ago, just tried on a similar setup (django+mod_wsgi+apache2) with a new version of BeautifulSoup 4.3.2 and it seems that the problem has been fixed.

lehins
  • 9,642
  • 2
  • 35
  • 49