3

Try to use a selector on scrapy shell to extract information from a web page and didn't work proprely. I believe that it happened because exist white space into class name. Any idea what's going wrong?

I tried different syntaxes like:

response.xpath('//p[@class="text-nnowrap hidden-xs"]').getall()

response.xpath('//p[@class="text-nnowrap hidden-xs"]/text()').get()

# what I type into my scrapy shell
response.css('div.offer-item-details').xpath('//p[@class="text-nowrap hidden-xs"]/text()').get()

# html code that I need to extract:
<p class="text-nowrap hidden-xs">Apartamento para arrendar: Olivais, Lisboa</p>

expected result: Apartamento para arrendar: Olivais, Lisboa

actual result: []

  • 2
    There isn’t really a whitespace in the classname. In html you can give multiple classes to a html element by seperating them with a whitespace in the class attribute. This means the

    had two classes: text-nowrap and hidden-xs. That might help you further debugging the problem. A quick search by myself led me to the following solution, didn't test it myself: https://stackoverflow.com/a/3881148/6511985

    – Stephan Schrijver May 16 '19 at 17:20
  • first check if page doesn't use JavaScript to add elements to HTML. Scrapy can't run JavaScript and you may have different HTML than you expect. – furas May 16 '19 at 17:20
  • Thanks @StephanSchrijver for your help. That's the point: classname doesn't have white space. Now I need to now how to use 'response.css()' selector to extract classname with whitespace in it. Do my research about. Thanks! – Elsior Moreira Alves Junior May 18 '19 at 07:55

2 Answers2

2

The whitespace in the class section means that there are multiple classes, the "text-nnowrap" class and the "hidden-xs" class. In order to select by xpath for multiple classes, you can use the following format:

"//element[contains(@class, 'class1') and contains(@class, 'class2')]"

(grabbed this from How to get html elements with multiple css classes)

So in your example, I believe this would work.

response.xpath("//p[contains(@class, 'text-nnowrap') and contains(@class, 'hidden-xs')]").getall()
Matt
  • 973
  • 8
  • 10
1

For this cases I prefer using css selectors because of its minimalistic syntax:
response.css("p.text-nowrap.hidden-xs::text")

Also google chrome developer tools displays css selectors when you observing html code
This makes scraper development much easier google developer tools

Georgiy
  • 3,158
  • 1
  • 6
  • 18