Scrapy extracting
with span inside

Question

I'm trying to extract the text from this html structure:

<div class="col-6 col-lg-3">
    <span class="font-weight-bold">List of Birds</span>
        <ul class="bird-forms">
            <li>Crow <span class="color">Black</span></li>
            <li>Peacock <span class="color">Multicolored</span></li>
            <li>Dove <span class="color">Multicolored</span></li>
            <li>Sparrow <span class="color">Brown</span></li>
            <li>Goose <span class="color">Multicolored</span></li>
            <li>Ostrich <span class="color">Multicolored</span></li>
        </ul>
</div>

Using scrapy shell: response.css('ul.bird-forms li ::text').extract()

I want to the result to look like this:

['Crow Black', 
 'Peacock Multicolored',
 'Dove Multicolored', 
 'Sparrow Brown', 
 'Goose Multicolored',
 'Ostrich Multicolored']

Instead of this:

['Crow',
 'Black', 
 'Peacock',
 'Multicolored', 
 'Dove', 
 'Multicolored', 
 'Sparrow', 
 'Brown',
 'Goose', 
 'Multicolored',
 'Ostrich', 
 'Multicolored']

score 2 · Accepted Answer · answered Jun 07 '20 at 09:13

2

Simply use XPath string():

birds = []
for li in response.xpath('//ul[@class="bird-forms"]/li'):
    bird = li.xpath('string(.)').get()
    birds.append(bird)

answered Jun 07 '20 at 09:13

gangabass

10,607
2
23
35

This is the best approach! The only issue is that the output list are mixed with 'Single quote ' and "Double quote", I even try to run replace on ``bird = li.xpath('string(.)').get().replace('""', "''") `` but didn't change nothing. Other than that it's perfect. – Goundo Jun 09 '20 at 06:52
Hmm... I don't understand where did you get quotes. Could you show your output? – gangabass Jun 09 '20 at 16:13
It's not that important I just got a list looks like this: ``['Crow Black', "Peacock Multicolored", 'Dove Multicolored', 'Sparrow Brown', "Goose Multicolored", 'Ostrich Multicolored']``. some element with single quote and some with double quote but it was working fine python does complain about it. That was the best approach! – Goundo Jun 09 '20 at 23:17

Georgiy · Answer 2 · 2020-06-06T13:58:04.383

You need to separately select li tags first and additionaly select text for each li tag:

data = []
for li_tag in response.css("ul.bird-forms li"):
    data.append(" ".join(li_tag.css("*::text").extract()))

the same as python list comprehension:

data = [" ".join(x.css("*::text").extract()) for x in response.css("ul.bird-forms li")]

print(data)
# output <class 'list'>: ['Crow  Black', 'Peacock  Multicolored',
# 'Dove  Multicolored', 'Sparrow  Brown', 'Goose  Multicolored', 'Ostrich  Multicolored']

sammywemmy · Answer 3 · 2020-06-06T13:52:15.180

0

We can pull the details separately and merge them after :

   li_tags = response.xpath(".//ul[@class='bird-forms']//li/text()").extract()
    color_tags = response.xpath(".//ul[@class='bird-forms']//span[@class='color']/text()").extract()


[" ".join(entry) for entry in zip(li_tags, color_tags)]

['Crow  Black',
 'Peacock  Multicolored',
 'Dove  Multicolored',
 'Sparrow  Brown',
 'Goose  Multicolored',
 'Ostrich  Multicolored']

edited Jun 06 '20 at 13:52

answered Jun 06 '20 at 12:56

sammywemmy

27,093
4
17
31

I am already getting the same result as yours, what I want is: ```['Crow Black', 'Peacock Multicolored'...],``` instead of ```['Crow ', 'Black',...]``` – Goundo Jun 06 '20 at 13:24
Oh I misunderstood. My bad – sammywemmy Jun 06 '20 at 13:27

Scrapy extracting with span inside

3 Answers3

Scrapy extracting
with span inside