1

I'm trying to extract the text from this html structure:

<div class="col-6 col-lg-3">
    <span class="font-weight-bold">List of Birds</span>
        <ul class="bird-forms">
            <li>Crow <span class="color">Black</span></li>
            <li>Peacock <span class="color">Multicolored</span></li>
            <li>Dove <span class="color">Multicolored</span></li>
            <li>Sparrow <span class="color">Brown</span></li>
            <li>Goose <span class="color">Multicolored</span></li>
            <li>Ostrich <span class="color">Multicolored</span></li>
        </ul>
</div>

Using scrapy shell: response.css('ul.bird-forms li ::text').extract()

I want to the result to look like this:

['Crow Black', 
 'Peacock Multicolored',
 'Dove Multicolored', 
 'Sparrow Brown', 
 'Goose Multicolored',
 'Ostrich Multicolored']

Instead of this:

['Crow',
 'Black', 
 'Peacock',
 'Multicolored', 
 'Dove', 
 'Multicolored', 
 'Sparrow', 
 'Brown',
 'Goose', 
 'Multicolored',
 'Ostrich', 
 'Multicolored']
Dhaval Taunk
  • 1,662
  • 1
  • 9
  • 17
Goundo
  • 95
  • 10

3 Answers3

2

Simply use XPath string():

birds = []
for li in response.xpath('//ul[@class="bird-forms"]/li'):
    bird = li.xpath('string(.)').get()
    birds.append(bird)
gangabass
  • 10,607
  • 2
  • 23
  • 35
  • This is the best approach! The only issue is that the output list are mixed with 'Single quote ' and "Double quote", I even try to run replace on ``bird = li.xpath('string(.)').get().replace('""', "''") `` but didn't change nothing. Other than that it's perfect. – Goundo Jun 09 '20 at 06:52
  • Hmm... I don't understand where did you get quotes. Could you show your output? – gangabass Jun 09 '20 at 16:13
  • It's not that important I just got a list looks like this: ``['Crow Black', "Peacock Multicolored", 'Dove Multicolored', 'Sparrow Brown', "Goose Multicolored", 'Ostrich Multicolored']``. some element with single quote and some with double quote but it was working fine python does complain about it. That was the best approach! – Goundo Jun 09 '20 at 23:17
1

You need to separately select li tags first and additionaly select text for each li tag:

data = []
for li_tag in response.css("ul.bird-forms li"):
    data.append(" ".join(li_tag.css("*::text").extract()))

the same as python list comprehension:

data = [" ".join(x.css("*::text").extract()) for x in response.css("ul.bird-forms li")]

print(data)
# output <class 'list'>: ['Crow  Black', 'Peacock  Multicolored',
# 'Dove  Multicolored', 'Sparrow  Brown', 'Goose  Multicolored', 'Ostrich  Multicolored']
Georgiy
  • 3,158
  • 1
  • 6
  • 18
0

We can pull the details separately and merge them after :

   li_tags = response.xpath(".//ul[@class='bird-forms']//li/text()").extract()
    color_tags = response.xpath(".//ul[@class='bird-forms']//span[@class='color']/text()").extract()


[" ".join(entry) for entry in zip(li_tags, color_tags)]

['Crow  Black',
 'Peacock  Multicolored',
 'Dove  Multicolored',
 'Sparrow  Brown',
 'Goose  Multicolored',
 'Ostrich  Multicolored']
sammywemmy
  • 27,093
  • 4
  • 17
  • 31