1

I use Beautiful Soup often to parse HTML files, so when I recently needed to parse an XML file, I chose to use it. However, because I'm parsing an extremely large file, it failed. When researching why it failed, I was led to this question: Loading huge XML files and dealing with MemoryError.

This leads me to my question: If lxml can handle large files and Beautiful Soup cannot, are there any benefits of using Beautiful Soup instead of simply using using lxml directly?

Community
  • 1
  • 1
Jake Sebright
  • 799
  • 8
  • 16

2 Answers2

1

If you look at this link about BeautifulSoup Parser:

"BeautifulSoup" is a Python package that parses broken HTML, while "lxml" does so faster but with high quality HTML/XML. So if you're dealing with the first one you're better off with BS... but the advantage of having "lxml" is that you're able to get the soupparser.

From that link I provided at the top it shows how you can use the capabilities of "BS" with "lxml"

So in the end... you are better off with "lxml".

Leb
  • 15,483
  • 10
  • 56
  • 75
1

lxml is very fast, and is relatively memory efficient. BeautifulSoup by itself scores less well on the efficiency end, but is built to be compatible with non-standard / broken html and xml, meaning it is ultimately more versatile.

Which you choose to use is really just dependent on your use-case -- web scraping? probably BS. Parsing machine-written structured metadata? lxml is a great choice.

There is also the learning-curve to consider when making the switch - the two systems implement search and navigation strategies in slightly different ways; enough to make learning one system after starting with the other a non-trivial task.

Walker
  • 423
  • 3
  • 8