I have found helpful links explaining how to extract all the text from the body here: How can I get all the plain text from a website with Scrapy?
However, in the process of extracting all the text, it also scrapes the text of the hyperlinks which I do not want. For example when scraping the website: http://quotes.toscrape.com/tag/humor/page/1/
I used the following extractor:
text = re.sub(' +',' ',re.sub('\n|\t|\r','',' '.join(response.selector.xpath('//body/descendant-or-self::*[not( self::script | self::style)]/text()').extract()))).strip()
I got the output of:
"Quotes to Scrape Login Viewing tag: humor “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” by ...."
The word "Login" comes from the text portion of the hyperlink
<a href="/login">Login</a>
Another example of a text coming from a hyperlink is:
<a href="#" data-toggle="tab" class="login-tab-links2 toplogin">KFN PUBlIC INVESTORS<small>K1 AND TAX INFO</small></a></li>
Where 'KFN PUBlIC INVESTORS' and 'K1 AND TAX INFO' get scraped too.
How can I avoid having text from the hyperlinks being scraped too?
Thanks so much in advance!