3

There is this HTML:

<div>
    <div data-id="1"> </div>
    <div data-id="2"> </div>
    <div data-id="3"> </div>
    ...
    <div> </div> 
</div>

I need to select the inner div that have the attribute data-id (regardless of values) only. How do I achieve that with Scrapy?

hydradon
  • 1,316
  • 1
  • 21
  • 52

4 Answers4

3

You can use the following

response.css('div[data-id]').extract()

It will give you a list of all divs with data-id attribute.

[u'<div data-id="1"> </div>',
 u'<div data-id="2"> </div>',
 u'<div data-id="3"> </div>']
Zeeshan
  • 1,078
  • 9
  • 14
0

Use BeautifulSoup. Code

from bs4 import BeautifulSoup

soup = BeautifulSoup("""<div> <div data-id="1"> </div> <div data-id="2"> </div>  <div data-id="3"> </div><div> </div> </div>""")  

print(soup.find_all("div", {"data-id":True}))

OUTPUT:

[<div data-id="1"> </div>, <div data-id="2"> </div>, <div data-id="3"> </div>]

You can specify which attribute to be present in find or find_all with the value as True

bigbounty
  • 16,526
  • 5
  • 37
  • 65
  • Hi, I have read somewhere that BeautifulSoup is slow. The HTML I have is a huge webpage while the example I gave in my post is just a simplified version. Is there any way to do it with just Selenium? – hydradon Nov 04 '19 at 17:35
  • One more way, I was thinking of is loop through the subdiv, check for the existence of that attribute, if it's there then process else skip it – bigbounty Nov 04 '19 at 17:42
  • BeautifulSoup is just a framework that uses a parser. Change to a something faster parser. Read - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use – bigbounty Nov 04 '19 at 17:43
0
<li class="gb_i" aria-grabbed="false">
<a class="gb_d" data-pid="192" draggable="false" href="xyz.com" id="gb192">
<div data-class="gb_u"></div>
<div data-class="gb_v"></div>
<div data-class="gb_w"></div>
<div data-class="gb_x"></div>
</a>
</li>

Take look and above example HTML code. To get all div containing data-class in Scrapy v1.6+

response.xpath('//a[@data-pid="192"]/div[contains(@data-class,"")]').getall()

In scrapy version <1.6 you can use extract() in place of getall(). Hope this helps

Adarsh Patel
  • 562
  • 3
  • 16
0
 scrapy shell
In [1]: b = '''
   ...: <div>
   ...:     <div data-id="1">gdfg </div>
   ...:     <div data-id="2">dgdfg </div>
   ...:     <div data-id="3">asdasd </div>
   ...:     <div> </div>
   ...: </div>
   ...: '''
In [2]: from scrapy import Selector

In [3]: sel = Selector(text=b, type="html")

In [4]: sel.xpath('//div[re:test(@data-id,"\d")]/text()').extract()
Out[4]: ['gdfg ', 'dgdfg ', 'asdasd ']
Wertartem
  • 237
  • 2
  • 5