0

I am trying to use BeautifulSoup and Selenium to scrape data from Airbnb. I want to gather each listing from this search page.

This is what I have so far:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def scrape_page(page_url):
    
    driver_path = "C:/Users/parkj/Downloads/chromedriver_win32/chromedriver.exe"
    driver = webdriver.Chrome(service = Service(driver_path))
    driver.get(page_url)
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'itemprop')))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()
    
    return soup

def extract_listing(page_url):
    
    page_soup = scrape_page(page_url)
    listings = page_soup.find_element(By.CLASS_NAME, "itemprop")
    return listings

page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown"
#items = extract_listing(page_url)

#process items to get all information you need, just an example
#[{'name':items.select_one('[itemprop="name"]')['content'],
#  'url':items.select_one('[itemprop="url"]')['content']} 
# for i in items]

test = scrape_page(page_url)
test

It seems like scrape_page( ) returns the HTML script from the search page, but does not contain the full content. It does not include the information I need, which is this part of the HTML:

Image of HTML Script

I did some research and I saw that WebDriverWait might help, but I get a TimeoutException Error.

TimeoutException Error

The end goal is to get each listing's name and URL. The first 3 items in the resulting list should look similar to this:

[{'name': '✿Kyoto✿/Near Station & Bus/Temple/Twin Room(^^♪✿✿',
  'url': 'www.airbnb.com/rooms/50290730?adults=1&children=0&infants=0&check_in=2022-07-20&check_out=2022-07-27&previous_page_section_name=1000'},
 {'name': 'Stay in Kyoto central island',
  'url': 'www.airbnb.com/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'},
 {'name': '和楽庵【Single】100 Year old Machiya Guest House (1pax)',
  'url': 'www.airbnb.com/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'}]

I apologize ahead if I did not include enough information in this question, as this is my first time posting here. I would appreciate any help, thank you.

2 Answers2

1

I don't use selenium too often but recommend the requests lib.

Try this

from requests import get
from bs4 import BeautifulSoup

headers = {'User-agent':'Mozilla/5.0 (X11; Linux i686; rv:100.0) Gecko/20100101 Firefox/100.0.'}

res = get('https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown', headers=headers)

soup = BeautifulSoup(res.text, features="html.parser")

url_list = soup.find_all("meta", attrs={"itemprop":"url"})

In my case, it returned 20 results, which is as many that can be displayed on one page. If you want more results to be returned then you need to scrape further pages.

The use of the Firefox user agent is very important. It provides an old scrape case usage, that a lot of webpages don't block when this agent is used.

djmonki
  • 3,020
  • 7
  • 18
Moonar
  • 141
  • 6
0

Select the elements you are waiting for more specific in this case with css selector:

wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))

Also try to avoid selenium syntax with beautifulsoup and also use css selectors in bs3 syntax:

listings = page_soup.select('[itemprop="itemListElement"]')

Example

...
def scrape_page(page_url):
    driver.get(page_url)
    wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[itemprop="itemListElement"]')))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    driver.close()
    
    return soup

def extract_listing(page_url):
    
    page_soup = scrape_page(page_url)
    listings = page_soup.select('[itemprop="itemListElement"]')
    return listings

page_url = "https://www.airbnb.com/s/Kyoto-Prefecture--Japan/homes?tab_id=home_tab&flexible_trip_lengths%5B%5D=one_week&refinement_paths%5B%5D=%2Fhomes&place_id=ChIJYRsf-SB0_18ROJWxOMJ7Clk&query=Kyoto%20Prefecture%2C%20Japan&date_picker_type=flexible_dates&search_type=unknown"
items = extract_listing(page_url)

#process items to get all information you need, just an example
[{'name':i.select_one('[itemprop="name"]')['content'],
 'url':i.select_one('[itemprop="url"]')['content']} 
for i in items]

Output

[{'name': '✿Kyoto✿/Nähe Bahnhof & Bus/Tempel/Einzelzimmer(^^♪',
  'url': 'www.airbnb.de/rooms/50293998?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
 {'name': '100 Jahre altes Machiya-Gästehaus (1Pax)',
  'url': 'www.airbnb.de/rooms/48645312?adults=1&children=0&infants=0&check_in=2022-08-22&check_out=2022-08-29&previous_page_section_name=1000'},
 {'name': '27, Deluxe Designer Zweibett- / Dreibettzimmer in Shijo (1-3 Personen  / Nichtraucher)',
  'url': 'www.airbnb.de/rooms/41413491?adults=1&children=0&infants=0&check_in=2023-05-16&check_out=2023-05-23&previous_page_section_name=1000'},
 {'name': 'Aufenthalt auf der zentralen Insel Kyoto',
  'url': 'www.airbnb.de/rooms/42780789?adults=1&children=0&infants=0&check_in=2022-06-24&check_out=2022-07-01&previous_page_section_name=1000'},
 {'name': 'Sweet 202 Privatzimmer ☘️',
  'url': 'www.airbnb.de/rooms/30217767?adults=1&children=0&infants=0&check_in=2022-07-18&check_out=2022-07-25&previous_page_section_name=1000'},
 {'name': 'Kyoto Sanjo Ohashi Superior Zweibettzimmer Studio Nichtraucher Superior Zweibettzimmer',
  'url': 'www.airbnb.de/rooms/45207535?adults=1&children=0&infants=0&check_in=2022-09-27&check_out=2022-10-04&previous_page_section_name=1000'},
 {'name': 'Toller Blick auf den Fluss, schönes traditionelles Haus',
  'url': 'www.airbnb.de/rooms/25762078?adults=1&children=0&infants=0&check_in=2022-12-07&check_out=2022-12-14&previous_page_section_name=1000'},
 {'name': 'Doppelzimmer - Waschmaschine in allen Zimmern ☆ Guest House 10-Minuten zu Fuß von Kyoto Station -',
  'url': 'www.airbnb.de/rooms/51433076?adults=1&children=0&infants=0&check_in=2022-06-13&check_out=2022-06-20&previous_page_section_name=1000'},
 {'name': 'In der Nähe des Bahnhofs Kyoto Gemütliches Zimmer in einem traditionellen Haus',
  'url': 'www.airbnb.de/rooms/25600163?adults=1&children=0&infants=0&check_in=2022-09-12&check_out=2022-09-19&previous_page_section_name=1000'},
 {'name': 'Gemütliche und ruhige zweistöckige japanische Wohnung',
  'url': 'www.airbnb.de/rooms/38743436?adults=1&children=0&infants=0&check_in=2023-03-11&check_out=2023-03-18&previous_page_section_name=1000'},
 {'name': '51★Günstigste★5 Minuten zu Fuß Shin-Osaka Sta.★Max 1 Gäste',
  'url': 'www.airbnb.de/rooms/14539052?adults=1&children=0&infants=0&check_in=2022-07-03&check_out=2022-07-10&previous_page_section_name=1000'},
 {'name': '和楽庵【Doppel】100 Jahre altes Machiya Gästehaus (2pax)',
  'url': 'www.airbnb.de/rooms/22867502?adults=1&children=0&infants=0&check_in=2022-08-26&check_out=2022-09-02&previous_page_section_name=1000'},
 {'name': 'Expo Hostel Nishi #1 /1000yen Fahrrad für deinen Aufenthalt',
  'url': 'www.airbnb.de/rooms/8295322?adults=1&children=0&infants=0&check_in=2022-08-27&check_out=2022-09-03&previous_page_section_name=1000'},
 {'name': '★Lovely RiverSide House in★der Nähe von Einkaufsviertel★3 Betten',
  'url': 'www.airbnb.de/rooms/40117962?adults=1&children=0&infants=0&check_in=2022-07-07&check_out=2022-07-14&previous_page_section_name=1000'},
 {'name': 'ZIMMER - Bereich Central Kyoto Gion',
  'url': 'www.airbnb.de/rooms/15215980?adults=1&children=0&infants=0&check_in=2022-06-14&check_out=2022-06-21&previous_page_section_name=1000'},
 {'name': 'Raum, um das Kyoto zu genießen.',
  'url': 'www.airbnb.de/rooms/9263813?adults=1&children=0&infants=0&check_in=2022-09-08&check_out=2022-09-15&previous_page_section_name=1000'},
 {'name': 'Stilvolles modernes Kyo-Machiya 500 金閣寺 m vom Trockner entfernt',
  'url': 'www.airbnb.de/rooms/20041502?adults=1&children=0&infants=0&check_in=2022-07-27&check_out=2022-08-03&previous_page_section_name=1000'},
 {'name': 'Hotel Sou Kyoto Gion Queen Studio',
  'url': 'www.airbnb.de/rooms/40236377?adults=1&children=0&infants=0&check_in=2022-06-22&check_out=2022-06-29&previous_page_section_name=1000'},
 {'name': 'Workation GroLiving in  KYOTO',
  'url': 'www.airbnb.de/rooms/612511811801466646?adults=1&children=0&infants=0&check_in=2022-07-19&check_out=2022-07-26&previous_page_section_name=1000'},
 {'name': '【home quarantin ok】shibainuatiniya/Kyoto Sta/Toji',
  'url': 'www.airbnb.de/rooms/34028813?adults=1&children=0&infants=0&check_in=2022-06-21&check_out=2022-06-28&previous_page_section_name=1000'}]
HedgeHog
  • 22,146
  • 4
  • 14
  • 36