IMDb HTML Extraction - With Beautiful Soup

Question

With Beautiful Soup4, I'm trying to get some text that doesn't seem to be tagged. (I may be wrong, I'm not very capable with HTML)

I need to extract several values from the IMDb code of the page; the budget value and the latest worldwide gross value for a particular film. The length of the code varies between films so if there is a method using Beautiful Soup4 to extract these values regardless of the line number, that would be hugely helpful. This is the code:

<div id="tn15content">
<h5>Budget</h5>
$165,000,000 (estimated)<br/>
<br/>

from the source code of this page: IMDb Box Office page for Interstellar

I need that '$165,000,000' to be extracted so I can store it etc.

The Gross code is even more confusing:

<h5>Gross</h5>
$188,020,017 (USA) (<a href="/date/03-19/">19 March</a> <a href="/year/2015/">2015</a>)<br/>$187,991,439 (USA) (<a href="/date/03-15/">15 March</a> <a href="/year/2015/">2015</a>)<br/>$187,930,551 (USA) (<a href="/date/03-14/">14 March</a> <a href="/year/2015/">2015</a>)<br/>$187,918,949 (USA) (<a href="/date/03-11/">11 March</a> <a href="/year/2015/">2015</a>)<br/>$187,888,097 (USA) (<a href="/date/03-08/">8 March</a> <a href="/year/2015/">2015</a>)<br/>

All I need from this is the most recent (the Worldwide figures are further through a huge chunk of code which I decided to leave out due to spacing on here.

I know there was a similar problem on here solved, however I couldn't get the solution to work nor could I comment to ask the user providing the answer for help with my particular solution due to being new to the site. I was going to try and get IMDbPY to work, however I wasn't sure how to get it to install with WinPython.

score 0 · Answer 1 · answered Jul 30 '15 at 09:01

0

use regular expression

\$([0-9,]+) \(USA\)

\$([0-9,]+) \(Worldwide\)

http://pythex.org/

answered Jul 30 '15 at 09:01

Piotr

29
3

IMDb HTML Extraction - With Beautiful Soup

1 Answers1