1

I have a piece of code to parse webpages. I want to remove all content between, div, ahref, h1.

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = "http://en.wikipedia.org/wiki/Viscosity"
try:
  ourUrl = opener.open(url).read()
except Exception,err:
  pass
soup = BeautifulSoup(ourUrl)                
dem = soup.findAll('p')     

for i in dem:
  print i.text

I want to print the text without any content between h1, ahref like i mentioned above.

user2707082
  • 155
  • 1
  • 6

1 Answers1

2

EDIT: From the comment "I want to return text which is not between any <div> and </div> tags.". This should strip out any blocks where a parent has a div tag:

raw = '''
<html>
Text <div> Avoid this </div>
<p> Nested <div> Don't get me either </div> </p>
</html>
'''

def check_for_div_parent(mark):
    mark = mark.parent
    if 'div' == mark.name:
        return True
    if 'html' == mark.name:
        return False
    return check_for_div_parent(mark)

soup = bs4.BeautifulSoup(raw)

for text in soup.findAll(text=True):
    if not check_for_div_parent(text):
        print text.strip()

This results in only two tags, ignore the div ones:

Text
Nested

Original response

It's unclear what you are trying to do exactly. First up, you should try to post a full working example as you seem to be missing your headers. Secondly, Wikipedia seems to have a stance against "bots" or automated downloaders

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

This can be avoided with the following lines of code

import urllib2, bs4

url = r"http://en.wikipedia.org/wiki/Viscosity"

req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib2.urlopen( req )

Now that we have the page, I think you just want to extract the main text using bs4. I would do something like this

soup = bs4.BeautifulSoup(con.read())
start_pos = soup.find('h1').parent

for p in start_pos.findAll('p'):
    para = ''.join([text for text in p.findAll(text=True)])
    print para

This gives me text that looks like:

The viscosity of a fluid is a measure of its resistance to gradual deformation by shear stress or tensile stress. For liquids, it corresponds to the informal notion of "thickness". For example, honey has a higher viscosity than water.[1] Viscosity is due to friction between neighboring parcels of the fluid that are moving at different velocities. When fluid is forced through a tube, the fluid generally moves faster near the axis and very slowly near the walls, therefore some stress (such as a pressure difference between the two ends of the tube) is needed to overcome the friction between layers and keep the fluid moving. For the same velocity pattern, the stress required is proportional to the fluid's viscosity. A liquid's viscosity depends on the size and shape of its particles and the attractions between the particles.[citation needed]

Community
  • 1
  • 1
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • Ok the link can be any website. I want to return text which is not between any
    and
    tags. Say "Viscosity is due to friction between neighboring parcels of the fluid that are moving at different velocities." was between div tags.
    – user2707082 Aug 26 '13 at 14:55
  • @user2707082 I've updated the answer based off your response. – Hooked Aug 26 '13 at 15:12
  • Hi yourd code is fine, but i have to check for each para and it rejects all because all have
    . Can you modify the code to just convert
    and
    to comments.
    – user2707082 Aug 27 '13 at 10:37