2

I have now idea why this piece of code, does not work with this particular site. In other cases it works ok.

    url = "http://www.i-apteka.pl/search.php?node=443&counter=all"
    content = requests.get(url).text
    soup = BeautifulSoup(content)

    links = soup.find_all("a", class_="n63009_prod_link")
    print links

In this case it prints "[]", but there are obviously some links. Any idea?:)

user985541
  • 717
  • 1
  • 8
  • 11
  • 2
    I don't see any links with a `n63009_table_out` class. The only thing with that class is a `div`. Did you mean `soup.select('.n63009_table_out a')`? – Pavel Anossov Apr 04 '13 at 21:12
  • Yeah, the only thing with that class is a `div`. So the code works fine—it successfully returns all 0 of the links with that class. – abarnert Apr 04 '13 at 21:14
  • There was little mistake, now it is correct but it still return [] – user985541 Apr 04 '13 at 21:15
  • Also this class looks sooo autogenerated. It feel it might change at any time. – Pavel Anossov Apr 04 '13 at 21:16
  • With `n63009_prod_link` your code works for me, I get 23 links. – Pavel Anossov Apr 04 '13 at 21:17
  • What versions of Python and BS are you using? – abarnert Apr 04 '13 at 21:17
  • request returns html in which I can easily find this classes – user985541 Apr 04 '13 at 21:18
  • python 2.7.1, but how to check BS version? – user985541 Apr 04 '13 at 21:20
  • You can do `help(bs4)` or `help(BeautifulSoup)` (or whatever name you imported it under); there's a VERSION section down near the bottom. Also, which parser are you using? – abarnert Apr 04 '13 at 21:21
  • and still... it is weird, if I change the url for example www.onet.pl it is ok, with this particular page even this code: links = soup.find_all("a") does not work, maybe this is some kind of problem with coding of this particular page? – user985541 Apr 04 '13 at 21:23
  • Possibly. Most likely the problem is whatever parser you're using. For example, I just tried with Python 2.7.2 and BS 4.1.3 with a variety of different parsers: stdlib html.parser works, html5lib-0.95 works, lxml 3.1.0 on Apple's libxml2 works, but lxml 3.1.0 on libxml2 2.9.0 from Homebrew fails. That's why we need to know what versions of everything you're using. (If you have no idea what I'm talking about with parsers, see [here](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser).) – abarnert Apr 04 '13 at 21:25
  • In that case… you may have found a bug in the standard built-in parser in an old version of BS that's no longer maintained. Can you upgrade to 4.x, or at least 3.2.1? – abarnert Apr 04 '13 at 21:32
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/27591/discussion-between-user985541-and-abarnert) – user985541 Apr 04 '13 at 21:33

2 Answers2

1

You've found a bug in whichever parser you're using.

I don't know which parser you're using but I do know this:

Python 2.7.2 (from Apple), BS 4.1.3 (from pip), libxml2 2.9.0 (from Homebrew), lxml 3.1.0 (from pip) gets the exact same error as you. Everything else I try—including the same things as above except libxml2 2.7.8 (from Apple)—works. And lxml is the default (at least as of 4.1.3) that BS will try first if you don't specify anything else. And I've seen other unexpected bugs with libxml2 2.9.0 (most of which have been fixed on trunk, but no 2.9.1 has been released yet).

So, if this is your problem, you may want to downgrade to 2.8.0 and/or build it from top of tree.

But if not… it definitely works for me with 2.7.2 with the stdlib html.parser, and in chat you tested the same think with 2.7.1. While html.parser (especially before 2.7.3) is slow and brittle, it seems to be good enough for you. So, the simplest solution is to do this:

soup = BeautifulSoup(content, 'html.parser')

… instead of just letting it pick its favorite parser.

For more info, see Specifying the parser to use (and the sections right above and below).

abarnert
  • 354,177
  • 51
  • 601
  • 671
0

I had the same problem where locally the Beautiful Soup was working and on my ubuntu Server was returning an empty list all the time. I've tried many parsers following the link [1] and tried many dependencies

Finally what worked for me was:

  • remove the beautiful soap installation
  • remove all its dependencies (pointed by the apt-get install python-bs4)
  • installing it again using the commands bellow

commands:

sudo apt-get install python-bs4

pip install beautifulsoup4

and I'm using the the following code:

soup = BeautifulSoup(my_html_content, 'html.parser')

[http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser][1]