1

i have recently found a very neat way of web scraping using bs4 that has a really nice organized structure to it. let us say this is our html code:

<div class="a">
  <div class="b">
    <a href="www.yelloaes.com">'hi'</a>
  </div>
  <div class ="c">
    <p><a href="www.bb.com">'hi again'</a></p>
    <div class="d">
      <p>'well this final'</p>
    </div>
  </div>
</div>


<div class="a">
  <div class="b">
    <a href="www.yelloaes1.com">'hi1'</a>
  </div>
  <div class ="c">
    <p><a href="www.bb1.com">'hi again1'</a></p>
    <div class="d">
      <p>'well this final1'</p>
    </div>
  </div>
</div>

now i am assuming <div class="a"> is our parent tag and we will suck info out of this tag, now that means i have to loop through this to extract info from all the page .

but because i was having a hard time understanding BeautifulSoup i did a test run with a python code to extract the info from the first iteration of this <div class= "a">

my code is like this :

soup  = BeautifulSoup(r.text)
find_hi =      soup.find('div',{'class':'a'}).div.text
find_hi-again =soup.find('div',{'class':'a'}).find_all('div')[1].p.text
find_final    =soup.find('div',{'class':'a'}).find('div',{'class':'d'}).text

print(find_hi , find_hi-again , find_final)

#output comes as (it worked !!!)
hi , hi again  , this is final 

Note: I really want to stick with this one so please no completely new ways of scraping. now i can't seem to loop on all the page . i tried this for looping but does not show the result i want to see:

soup  = BeautifulSoup(r.text)
#To have a list of all div tags having this class
scraping  = soup.find_all('div',{'class':'a'})
for i in scraping:
    find_hi =      i.div.text
    find_hi-again =i.find_all('div')[1].p.text
    find_final    =i.find('div',{'class':'d'}).text

print(find_hi , find_hi-again , find_final)

please help in looping ?

Anurag Pandey
  • 373
  • 2
  • 5
  • 21
  • What is the result that is shown? – sushant Jul 30 '16 at 08:22
  • it is showing a result but it is not showing the different elements but is showing repeated elements from the same tag like ,*** hi , hi again , well this final , hi , hi again , well this final ** instead of **hi , hi again , well this final , hi1 , hi again1 , well this final1 ** – Anurag Pandey Jul 30 '16 at 09:02
  • Share the url if possible and what you expect as output, your current code make little sense. – Padraic Cunningham Jul 30 '16 at 09:53
  • i want to print the contents from expedia.com like hotel name , ratings etc , when i inspected the website i found a pattern of div tags with a unique class objects for each hotel listed . Now i used **find** not **find_all** so to print the desired the things from the first div tag , my question is simply how to loop for each hotels ? – Anurag Pandey Jul 30 '16 at 10:28
  • From memory expedia is heavily reliant on javascript so you won't get the source using anything other than something that can run javascript – Padraic Cunningham Jul 30 '16 at 18:06
  • yaa @PadraicCunningham i did not at last found what i was looking for in expedia.com but i did manage to scrap coursera.com and udacity.com , – Anurag Pandey Jul 31 '16 at 06:39

1 Answers1

0

You code works fine for me, except for the syntax error: find_hi-again is not a valid variable name.

divs  = soup.find_all('div',{'class':'a'})
for i in divs:
    find_hi = i.div.text.strip()
    find_hi_again = i.find_all('div')[1].p.text.strip()
    find_final = i.find('div',{'class':'d'}).text.strip()

    print(find_hi , find_hi_again , find_final)

## (u"'hi'", u"'hi again'", u"'well this final'")
## (u"'hi1'", u"'hi again1'", u"'well this final1'")
Julien Spronck
  • 15,069
  • 4
  • 47
  • 55
  • please try on large real page and tell me if it works because it is not working for me . and yaa i made that name up i know that it throws an error. – Anurag Pandey Jul 30 '16 at 09:46
  • thanks for the answer i checked it works i guess it does not works on some sites i tried on expedia and makemytrip and it was not working but on other sites it worked – Anurag Pandey Jul 30 '16 at 11:38