I'm experimenting on how to scrape a website for data.
This is what I've put together after a few days of research, however, the output from Nokogiri is not as "clean" as I would expect. When I print my array, I get a lot of line-break "/n
" in the output.
require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'
# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')
# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)
# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
property_details = d.text
details_array.push(property_details)
end
Pry.start(binding)
While in Pry, if I display details_array
or address_array
, output looks like:
[2] pry(main)> details_array
=> ["\n \n \n \n 2265 Tanglewood Cir NE,\n Atlanta,\n GA\n 30345\n \n \n\n \n Dresden East\n \n \n\n $289,900\n \n \n \n 3 bd\n 2 ba\n 1,566 sq ft\n
0.3 acres lot\n \n \n \n \n Single Family Home\n \n \n \n \n
Brokered by Re/Max Town And Country\n \n \n
\n \n \n Brokered by \n Re/Max
Town And Country\n \n \n \n ", "\n \n
\n \n 2141 Dunwoody Gln,\n
Atlanta,\n GA\n 30338\n \n \n\n
\n \n $469,900\n \n \n
\n 4 bd\n 3 ba\n 2,850 sq
ft\n 0.3 acres lot\n 2 car\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by
Buckhead Home Realty Llc\n \n \n \n
\n \n Brokered by \n Buckhead Home
Realty Llc\n \n \n \n ", "\n \n
\n \n 1048 Martin St SE,\n
Atlanta,\n GA\n 30315\n \n \n\n
\n Intown South\n Peoplestown\n \n \n
\n $164,900\n \n \n \n
5 bd\n 3 ba\n 2,376 sq ft\n
7,405 sq ft lot\n \n \n \n \n
Single Family Home\n \n \n \n \n
Brokered by Greenlet Llc\n \n \n \n
\n \n Brokered by \n Greenlet Llc\n
\n \n \n ", "\n \n \n \n
1048 Martin St SE,\n Atlanta,\n GA\n
30315\n \n \n\n \n Intown South\n
Peoplestown\n \n \n \n $164,900\n
\n \n \n 5 bd\n 3
ba\n 2,055 sq ft\n 7,584 sq ft lot\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by
Greenlet, Llc\n \n \n \n \n
\n Brokered by \n Greenlet, Llc\n \n
\n \n ", "\n \n \n \n
1991 Woodbine Ter NE,\n Atlanta,\n GA\n
30329\n \n \n\n \n Sagamore Hills\n
\n \n \n $299,900\n \n \n
\n 3 bd\n 1+ ba\n 1,449
sq ft\n 0.8 acres lot\n \n \n
\n \n Single Family Home\n \n \n
\n :