Scrapy: Scraping all the text from a website but not the text of hyperlinks

Question

I have found helpful links explaining how to extract all the text from the body here: How can I get all the plain text from a website with Scrapy?

However, in the process of extracting all the text, it also scrapes the text of the hyperlinks which I do not want. For example when scraping the website: http://quotes.toscrape.com/tag/humor/page/1/

I used the following extractor:

text = re.sub(' +',' ',re.sub('\n|\t|\r','',' '.join(response.selector.xpath('//body/descendant-or-self::*[not( self::script | self::style)]/text()').extract()))).strip()

I got the output of:

"Quotes to Scrape Login Viewing tag: humor “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” by ...."

The word "Login" comes from the text portion of the hyperlink

<a href="/login">Login</a>

Another example of a text coming from a hyperlink is:

<a href="#" data-toggle="tab" class="login-tab-links2 toplogin">KFN PUBlIC INVESTORS<small>K1 AND TAX INFO</small></a></li>

Where 'KFN PUBlIC INVESTORS' and 'K1 AND TAX INFO' get scraped too.

How can I avoid having text from the hyperlinks being scraped too?

Thanks so much in advance!

score 1 · Accepted Answer · answered Aug 02 '17 at 06:28

1

You can check whether nodes parent or an ancestor is a node you dont want.

For example:
This xpath will find all text of nodes that are not children of <a> nodes:

//text()[not(parent::a)]

Alternatively you can use ancestor which checks whether any of the ancestors are <a> nodes (this means a parent, grandparent, grandgrandparent and so on):

//text()[not(ancestor::a)]

answered Aug 02 '17 at 06:28

Granitosaurus

20,530
5
57
82

Thanks for the quick reply! I tried using two forms of the [not(parent::a)] paradigm. `'//body/descendant-or-self::*[not(self::script | self::style | ancestor::a )]/text()')` as well as `'//body/descendant-or-self::*[not(self::script | self::style )]/text()[not(ancestor::a)]')` but both did not work. Lastly i tried the simple way `'//body//text()[not(parent::a)]')` and that didn't work either. Is there anything I'm doing wrongly (I tried both combinations with parents and ancestors)? – datumscientist Aug 02 '17 at 07:47
@BenedictLim what do you mean by did not work? it captures text under `` node? Could you elaborate more what result are you expcting? – Granitosaurus Aug 02 '17 at 10:21
Actually I tried `'//body/descendant-or-self::*[not(self::script | self::style | self::a )]/text()'`, but replacing ancestor with self and it worked for the most part! I think if i keep excluding elements in this format should be able to reduce the types of text I want to avoid. Thanks! – datumscientist Aug 02 '17 at 15:33

Scrapy: Scraping all the text from a website but not the text of hyperlinks

1 Answers1