how to extract text written outside h4 tag using scrapy python

Question

the field mark with blue, those are the field i am trying to scrape

<div class="txt-block">
   <h4 class="inline">Budget:</h4>
   "€650,000
                         "
   <span class="attribute">(estimated)</span>
</div>

I want to scrape data which is outside h4 tag i.e €650,000. how can I do it using scrapy css in python.

I was trying this but it returns multiple field.

item['Budget'] = response.css(".txt-block h4:not(span)::text").extract()

score 1 · Answer 1 · answered Jan 18 '19 at 19:52

1

Try to use following-sibling::text() in your xpath. Like this: response.xpath('//div[contains(@class, "txt-block")]/h4/following-sibling::text()').get() It gives needed information.

answered Jan 18 '19 at 19:52

vezunchik

3,669
3
16
25

there are multiple blank spaces, \n, so output for few of the tags is blank.how to fetch the whole data which inlcude blanks and \n as well – Dharmik Mehta Jan 18 '19 at 20:07
Let's strip them like here: `[i.strip() for i in response.xpath('//div[contains(@class, "txt-block")]/h4/following-sibling::text()').extract() if i.strip()]` – vezunchik Jan 18 '19 at 20:11
this isn't working for me, it fetches all the data from parent node. – Dharmik Mehta Jan 18 '19 at 20:24
Weird about parent node. Added some regexp to avoid extra blanks `response.xpath('//div[@class="txt-block"]/h4/following-sibling::text()').re(r'(.[\d\,]+)')`. But you can see from this expressing that we get h4 as base and get text, that follows this h4. – vezunchik Jan 18 '19 at 20:31

SIM · Answer 2 · 2019-01-18T20:55:51.653

It seems you look for a real-life demo. Check out the following implementation:

import requests
from scrapy import Selector

url = "https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=702AB91P12YZ9Z98XH5T&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1"

res = requests.get(url)
sel = Selector(res)
budget = ' '.join(sel.css(".txt-block:contains('Budget')::text").extract()).strip()
gross = ' '.join(sel.css(".txt-block:contains('Gross USA')::text").extract()).strip()
cumulative = ' '.join(sel.css(".txt-block:contains('Cumulative Worldwide')::text").extract()).strip()
print(f'budget: {budget}\ngross: {gross}\ncumulative: {cumulative}')

Output at this moment:

budget: $25,000,000
gross: $28,341,469
cumulative: $58,500,000

score 0 · Answer 3 · answered Jan 18 '19 at 20:09

0

Try using:

data = [d.strip() for d in response.css('.txt-block::text') if d.strip()]

data you want is in div tag actually and i'm using that tag to get the data.

answered Jan 18 '19 at 20:09

ThunderMind

789
5
14

score 0 · Answer 4 · answered Jan 18 '19 at 20:43

You need to extract text into an array and get the value from array at the desired location. Example

import scrapy
# Print Your code here
html_text="""
<div class="txt-block">'+
    <h4 class="inline">Budget:</h4>650,000
    <span class="attribute">(estimated)</span>
</div>
 """
# Parse text selector
selector=scrapy.Selector(text=html_text)
print(selector)
# Extract div
d=selector.xpath('//div[@class="txt-block"]//text()')
values=d.extract() # Gives an array of text values
print(values)
# Value index 2 is what you need
print(values[2])

Scrapy lacks tag deletion that is available in BeautifulSoup.

how to extract text written outside h4 tag using scrapy python

4 Answers4