1

I’m trying to scrape some data of TripAdvisor. I'm interested to get the "Price Range/ Cuisine & Meals" of restaurants.

So I use the following xpath to extract each of this 3 lines in the same class :

response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()').extract()[1]

I'm doing the test directly in scrapy shell and it's working fine :

scrapy shell https://www.tripadvisor.com/Restaurant_Review-g187514-d15364769-Reviews-La_Gaditana_Castellana-Madrid.html

But when I integrate it to my script, I've the following error :

    Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/root/Scrapy_TripAdvisor_Restaurant-master/tripadvisor_las_vegas/tripadvisor_las_vegas/spiders/res_las_vegas.py", line 64, in parse_listing
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
  File "/usr/lib/python3.6/site-packages/parsel/selector.py", line 61, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range

I paste you part of my code and I explain it below :

# extract restaurant cuisine
    row_cuisine_overviewcard = \
    (response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
    row_cuisine_card = \
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
    
    
    if (row_cuisine_overviewcard == "CUISINES"):
        cuisine = \
        response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
    elif (row_cuisine_card == "CUISINES"):
        cuisine = \
        response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
    else:
        cuisine = None

In tripAdvisor restaurants, there is 2 different type of pages, with 2 different format. The first with a class overviewcard, an the second, with a class cards

So I want to check if the first is present (overviewcard), if not, execute the second (card), and if not, put "None" value.

:D But looks like Python execute both .... and as the second one don't exist in the page, the script stop.

Could it be an indentation error ?

Thanks for your help Regards

2 Answers2

2

Your second selector (row_cuisine_card) fails because the element does not exist on the page. When you then try to access [1] in the result it throws an error because the result array is empty.

Assuming you really want item 1, try this

row_cuisine_overviewcard = \
(response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
# Here we get all the values, even if it is empty.
row_cuisine_card = \
(response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()').getall()) 


if (row_cuisine_overviewcard == "CUISINES"):
    cuisine = \
    response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
# Here we check first if that result has more than 1 item, and then we check the value.
elif (len(row_cuisine_card) > 1 and row_cuisine_card[1] == "CUISINES"):
    cuisine = \
    response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
    cuisine = None

You should apply the same kind of safety checking whenever you try to get a specific index from a selector. In other words, make sure you have a value before you access it.

malberts
  • 2,488
  • 1
  • 11
  • 16
1

Your problem is already in your check in this line_

row_cuisine_card = \
    (response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])

You are trying to extract a value from the website that may not exist. In other words, if

response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')

returns no or only one element, then you cannot access the second element in the returned list (which you want to access with the appended [1]).

I would recommend storing the values that you extract from the website into a local variable first in order to then check whether or not a value that you want has been found. My guess is that the page it breaks on does not have the information you want.

This could roughly look like the following code:

# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2:
    row_cuisine_overviewcard = cuisine_overviewcard_sections[1]
    cuisine_card_sections = response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
    if len(cuisine_card_sections) >= 2:
        row_cuisine_card = cuisine_card_sections[1]
        if (row_cuisine_overviewcard == "CUISINES"):
            cuisine = \
            response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
        elif (row_cuisine_card == "CUISINES"):
            cuisine = \
            response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]

Since you only need a part of the information, if the first XPath check already returns the correct answer, the code can be beautified a bit:

# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2 and cuisine_overviewcard_sections[1] == "CUISINES":
    cuisine = \
            response.xpath('//div[@class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
else:
    cuisine_card_sections = response.xpath('//div[@class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
    if len(cuisine_card_sections) >= 2 and cuisine_card_sections[1] == "CUISINES":
        cuisine = \
            response.xpath('//div[@class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]

This way you only do a (potentially expensive) XPath search when it actually is necessary.

ingofreyer
  • 1,086
  • 15
  • 27