How to extract text between
in Python?

Question

I'm stuck extracting text between <h1> and </h1>.

Please help me.

My code is:

import bs4
import re
import urllib2

url2='http://www.flipkart.com/mobiles/pr?sid=tyy,4io&otracker=ch_vn_mobile_filter_Top%20Brands_All#jumpTo=0|20'
htmlf = urllib2.urlopen(url2)
soup = bs4.BeautifulSoup(htmlf)
#res=soup.findAll('div',attrs={'class':'product-unit'})
for res in soup.findAll('a',attrs={'class':'fk-display-block'}):
    suburl='http://www.flipkart.com/'+res.get('href')
    subhtml = urllib2.urlopen(suburl)
    subhtml = subhtml.read()
    subhtml = re.sub(r'\s\s+','',subhtml)
    subsoup=bs4.BeautifulSoup(subhtml)
    res2=subsoup.find('h1',attrs={'itemprop':'name'})
    if res2:
        print res2

The output:

<h1 itemprop="name">Moto G</h1>
<h1 itemprop="name">Moto E</h1>
<h1 itemprop="name">Moto E</h1>

But I want this:

Moto G
Moto E
Moto E

shaktimaan · Accepted Answer · 2014-08-26T03:19:48.343

5

On any HTML tag, doing a get_text() gives the text associated with the tag. So, you just need to use get_text() on res2. i.e.,

if res2:
    print res2.get_text()

PS: As a side note, I think this line subhtml = re.sub(r'\s\s+','',subhtml) in your code is an expensive operation. If all you are doing is getting rid of the excessive spaces, you could do that with:

if res2:
    print res2.get_text().strip()

edited Aug 26 '14 at 03:19

answered Aug 26 '14 at 03:12

shaktimaan

11,962
2
29
33

You could also use `res2.text` instead of `res2.get_text()`. More info [here](https://stackoverflow.com/questions/35496332/differences-between-text-and-get-text). – J0ANMM Jun 05 '17 at 08:16

score 0 · Answer 2 · edited Jan 08 '21 at 08:55

0

You can try this:

 res2=subsoup.find('h1',attrs={'itemprop':'name'})
    if res2:
        print res2.text

add res2.text and it will do the trick.

edited Jan 08 '21 at 08:55

Ajay Lingayat

1,465
1
9
25

answered Jan 08 '21 at 03:39

Madushan Ranasinhe

1

How to extract text between in Python?

2 Answers2

How to extract text between
in Python?