How to extract the main text body of a url, discarding all irrelevant data

Question

    import urllib
    from bs4 import BeautifulSoup
    import urlparse
    import mechanize
    url = "http://www.wholefoodsmarket.com/forums"
    br = mechanize.Browser()
    urls = [url]
    visited = [url]
    while len(urls)>0:
        try: 
           br.open(urls[0])
           urls.pop(0)    
           for link in br.links():
                newurl = urlparse.urljoin(link.base_url,link.url)
                b1 = urlparse.urlparse(newurl).hostname
                b2 = urlparse.urlparse(newurl).path
                newurl =  "http://"+b1+b2
                if newurl not in visited and urlparse.urlparse(url).hostname in newurl:
                    urls.append(newurl)
                    visited.append(newurl)
                    ur = urllib.urlopen(newurl)
                    soup = BeautifulSoup(ur.read())
                    html = soup.find_all()
                    print html
                    f = open('content.txt', 'a')
                    f.write(newurl)
                    f.write("\n")
                    print >>f.write(soup.title.string)
                    f.write("\n")
                    f.write(soup.head)
                    f.write("\n")
                    f.write(soup.body)
                    print >>f, "Next Link\n"
                    f.close()
       except:
           print "error"
           urls.pop(0)

I am trying to recursively crawl html pages data upto 1 GB and then extract the relevant text data i.e discarding all code, html tags. Can someone suggest some link I can follow.

So, what do you want to do with the data? How you go about this will depend very much on the form you would like your data to be in — Nick Bailey, Feb 19 '15 at 04:30
I need the forum discussions in text form appended in a content file with format [URL,Title,Text] for all child pages of www.wholefoodsmarket.com. Idea is to collect enough amount of data then use them for building a search engine. I have chosen category as Food. My other 200 mates have different categories. — Biparite, Feb 19 '15 at 04:47
I have posted a link to another question below in comments. The answer given to that question works for me. It extracts all the javascripts, tags and leaves me with required text. Thanks for your efforts. — Biparite, Feb 22 '15 at 00:23

score 0 · Answer 1 · answered Feb 19 '15 at 05:55

0

You could try using the get_text method.

Relevant code snippet:

soup = BeautifulSoup(html_doc)
print(soup.get_text())

Hope it gets you started in the right direction

answered Feb 19 '15 at 05:55

Vivek Pradhan

4,777
3
26
46

Yeah I looked at that method. Though that gives me the text, it has bunch of code with tags
glued with it, I don't want that.
– Biparite Feb 19 '15 at 05:57
http://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript – Biparite Feb 19 '15 at 22:18

score 0 · Answer 2 · edited May 23 '17 at 10:24

In the case you are not limited to BeautifulSoup I would suggest you exploring xpath capabilities.

As an example to get all the text from a page you would need an expression as simple as this one:

//*/text()

The text from all links will be:

//a/text()

Similar expressions can be used to extract all info you need. More info on XPath here:https://stackoverflow.com/tags/xpath/info

In the case you got problems building up the crawler from the scratch think about using an already implemented one (as Scrapy)

How to extract the main text body of a url, discarding all irrelevant data

2 Answers2