0

I'm experimenting on how to scrape a website for data.

This is what I've put together after a few days of research, however, the output from Nokogiri is not as "clean" as I would expect. When I print my array, I get a lot of line-break "/n" in the output.

require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'

# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')

# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)

# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
    property_details = d.text
    details_array.push(property_details)
end

Pry.start(binding)

While in Pry, if I display details_array or address_array, output looks like:

[2] pry(main)> details_array
=> ["\n      \n        \n          \n                2265 Tanglewood Cir NE,\n            Atlanta,\n            GA\n            30345\n \n        \n\n        \n          Dresden East\n        \n        \n\n            $289,900\n          \n          \n            \n        3 bd\n                2 ba\n                1,566 sq ft\n             
0.3 acres lot\n            \n          \n        \n          \n            Single Family Home\n          \n        \n          \n            \n  
Brokered by Re/Max Town And Country\n            \n          \n       
\n        \n          \n            Brokered by \n            Re/Max
Town And Country\n          \n        \n      \n    ",  "\n      \n   
\n          \n                2141 Dunwoody Gln,\n           
Atlanta,\n            GA\n            30338\n          \n        \n\n 
\n          \n            $469,900\n          \n          \n          
\n                4 bd\n                3 ba\n                2,850 sq
ft\n                0.3 acres lot\n                2 car\n           
\n          \n        \n          \n            Single Family Home\n  
\n        \n          \n            \n              Brokered by
Buckhead Home Realty Llc\n            \n          \n        \n       
\n          \n            Brokered by \n            Buckhead Home
Realty Llc\n          \n        \n      \n    ",  "\n      \n       
\n          \n                1048 Martin St SE,\n           
Atlanta,\n            GA\n            30315\n          \n        \n\n 
\n          Intown South\n          Peoplestown\n        \n        \n 
\n            $164,900\n          \n          \n            \n        
5 bd\n                3 ba\n                2,376 sq ft\n             
7,405 sq ft lot\n            \n          \n        \n          \n     
Single Family Home\n          \n        \n          \n            \n  
Brokered by Greenlet Llc\n            \n          \n        \n       
\n          \n            Brokered by \n            Greenlet Llc\n    
\n        \n      \n    ",  "\n      \n        \n          \n         
1048 Martin St SE,\n            Atlanta,\n            GA\n           
30315\n          \n        \n\n        \n          Intown South\n     
Peoplestown\n        \n        \n          \n            $164,900\n   
\n          \n            \n                5 bd\n                3
ba\n                2,055 sq ft\n                7,584 sq ft lot\n    
\n          \n        \n          \n            Single Family Home\n  
\n        \n          \n            \n              Brokered by
Greenlet, Llc\n            \n          \n        \n        \n         
\n            Brokered by \n            Greenlet, Llc\n          \n   
\n      \n    ",  "\n      \n        \n          \n               
1991 Woodbine Ter NE,\n            Atlanta,\n            GA\n         
30329\n          \n        \n\n        \n          Sagamore Hills\n   
\n        \n          \n            $299,900\n          \n          \n
\n                3 bd\n                1+ ba\n                1,449
sq ft\n                0.8 acres lot\n            \n          \n      
\n          \n            Single Family Home\n          \n        \n  
\n           :
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
pjw23
  • 73
  • 2
  • 7
  • What do you expect to have where `"\n"` occurred in the page source? _Sidenote:_ `details_array = parse_page.css('div.srp-item-body').map(&:text)` would fill the `details_array` for you in more rubyish manner. – Aleksei Matiushkin Nov 22 '16 at 15:10
  • Please read "[mcve]". When asking about parsing, it's especially important that you supply the absolute minimum HTML that demonstrates the problem. Without that we have to generate it which wastes our time when trying to help you. – the Tin Man Nov 22 '16 at 20:08

1 Answers1

0

It looks like you're not digging into the document far enough with your selector. Consider this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div>
      <p>foo</p>
      <p>bar</p>
    </div>
  </body>
</html>
EOT

doc.search('div').map(&:text) # => ["\n      foo\n      bar\n    "]

When looking at the text of a parent tag you'll get both the text nodes used to format the HTML, plus the text of the desired <p> node.

If you drill down to the actual nodes you want and then get their text you'll remove the inter-tag formatting:

doc.search('div p').map(&:text) # => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • @Tin Man, thank you so much for the info. I was able to clean a lot of the data up by digging a little deeper, however, there are still a few that I can't figure out. – pjw23 Nov 23 '16 at 03:57
  • @Tin Man, thank you so much for the info. I was able to clean a lot of the data up by digging a little deeper, however, there are still a few that I can't figure out. I'm trying to pull the price where the tag looks like:
  • $100,000
  • I'm using the following code, but can't seem to get just the text, I still get "\n".... My code: parse_property_page.css('li.srp-item-price.srp-items-floated').map(&:text) Output looks like: \n $100,000 \n, Any ideas – pjw23 Nov 23 '16 at 04:07
  • That isn't enough to work from. You need to add the _minimum_ HTML input data to your question, formatted appropriately, that demonstrates what you're trying to do. Obviously your HTML is more complex than the little bit in your mentioned `
  • ` tag in the comment. Think of it this way, you want our help, so in exchange we expect you to do the prep-work in your question to help us help you.
  • – the Tin Man Nov 23 '16 at 18:17