Scrapy select HTML elements that have specific attribute name

Question

There is this HTML:

<div>
    <div data-id="1"> </div>
    <div data-id="2"> </div>
    <div data-id="3"> </div>
    ...
    <div> </div> 
</div>

I need to select the inner div that have the attribute data-id (regardless of values) only. How do I achieve that with Scrapy?

With scrapy, can't you use `response.css('div[data-id]')` – Zeeshan Nov 04 '19 at 17:58 — Zeeshan, Nov 04 '19 at 17:58

score 3 · Accepted Answer · answered Nov 04 '19 at 18:05

3

You can use the following

response.css('div[data-id]').extract()

It will give you a list of all divs with data-id attribute.

[u'<div data-id="1"> </div>',
 u'<div data-id="2"> </div>',
 u'<div data-id="3"> </div>']

answered Nov 04 '19 at 18:05

Zeeshan

1,078
9
14

score 0 · Answer 2 · answered Nov 04 '19 at 17:17

0

Use BeautifulSoup. Code

from bs4 import BeautifulSoup

soup = BeautifulSoup("""<div> <div data-id="1"> </div> <div data-id="2"> </div>  <div data-id="3"> </div><div> </div> </div>""")  

print(soup.find_all("div", {"data-id":True}))

OUTPUT:

[<div data-id="1"> </div>, <div data-id="2"> </div>, <div data-id="3"> </div>]

You can specify which attribute to be present in find or find_all with the value as True

answered Nov 04 '19 at 17:17

bigbounty

16,526
5
37
65

Hi, I have read somewhere that BeautifulSoup is slow. The HTML I have is a huge webpage while the example I gave in my post is just a simplified version. Is there any way to do it with just Selenium? – hydradon Nov 04 '19 at 17:35
One more way, I was thinking of is loop through the subdiv, check for the existence of that attribute, if it's there then process else skip it – bigbounty Nov 04 '19 at 17:42
BeautifulSoup is just a framework that uses a parser. Change to a something faster parser. Read - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use – bigbounty Nov 04 '19 at 17:43

Adarsh Patel · Answer 3 · 2019-11-04T18:09:10.430

<li class="gb_i" aria-grabbed="false">
<a class="gb_d" data-pid="192" draggable="false" href="xyz.com" id="gb192">
<div data-class="gb_u"></div>
<div data-class="gb_v"></div>
<div data-class="gb_w"></div>
<div data-class="gb_x"></div>
</a>
</li>

Take look and above example HTML code. To get all div containing data-class in Scrapy v1.6+

response.xpath('//a[@data-pid="192"]/div[contains(@data-class,"")]').getall()

In scrapy version <1.6 you can use extract() in place of getall(). Hope this helps

score 0 · Answer 4 · answered Nov 04 '19 at 20:12

 scrapy shell
In [1]: b = '''
   ...: <div>
   ...:     <div data-id="1">gdfg </div>
   ...:     <div data-id="2">dgdfg </div>
   ...:     <div data-id="3">asdasd </div>
   ...:     <div> </div>
   ...: </div>
   ...: '''
In [2]: from scrapy import Selector

In [3]: sel = Selector(text=b, type="html")

In [4]: sel.xpath('//div[re:test(@data-id,"\d")]/text()').extract()
Out[4]: ['gdfg ', 'dgdfg ', 'asdasd ']

Scrapy select HTML elements that have specific attribute name

4 Answers4