I`m trying to scrape some data for airlines from the following website: http://www.airlinequality.com/airline-reviews/airasia-x[1].
I managed to get the data I need, but I am struggling with pagination on the web page. I`m trying to get all the title of the reviews (not only the ones in the first page).
The links of the pages are in the format: http://www.airlinequality.com/airline-reviews/airasia-x/page/3/
where 3
is the number of the page.
I tried to loop through these URLs and also the following piece of code but scraping through the pagination is not working.
# follow pagination links
for href in response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > ul li a'):
yield response.follow(href, self.parse)
How to solve this?
import scrapy
import re # for text parsing
import logging
from scrapy.crawler import CrawlerProcess
class AirlineSpider(scrapy.Spider):
name = 'airlineSpider'
# page to scrape
start_urls = ['http://www.airlinequality.com/review-pages/a-z-airline-reviews/']
def parse(self, response):
# take each element in the list of the airlines
for airline in response.css("div.content ul.items li"):
# go inside the URL for each airline
airline_url = airline.css('a::attr(href)').extract_first()
# Call parse_airline
next_page = airline_url
if next_page is not None:
yield response.follow(next_page, callback=self.parse_article)
# follow pagination links
for href in response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > ul li a'):
yield response.follow(href, self.parse)
# to go to the pages inside the links (for each airline) - the page where the reviews are
def parse_article(self, response):
yield {
'appears_ulr': response.url,
# use sub to replace \n\t \r from the result
'title': re.sub('\s+', ' ', (response.css('div.info [itemprop="name"]::text').extract_first()).strip(' \t \r \n').replace('\n', ' ') ).strip(),
'reviewTitle': response.css('div.body .text_header::text').extract(),
#'total': response.css('#main > section.layout-section.layout-2.closer-top > div.col-content > div > article > div.pagination-total::text').extract_first().split(" ")[4],
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'air_test.json'
})
# minimizing the information presented on the scrapy log
logging.getLogger('scrapy').setLevel(logging.WARNING)
process.crawl(AirlineSpider)
process.start()
To iterate through the airlines I solved it using this code: it using the piece of code above:
req = Request("http://www.airlinequality.com/review-pages/a-z-airline-reviews/" , headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req)
soupAirlines = BeautifulSoup(html_page, "lxml")
URL_LIST = []
for link in soupAirlines.findAll('a', attrs={'href': re.compile("^/airline-reviews/")}):
URL_LIST.append("http://www.airlinequality.com"+link.get('href'))