0

Usually when scraping websites, I use "SelectorGadget". If not, I would have to inspect some elements on a page.

However, I am running in to a bit of trouble when trying to scrape this one website.

The HTML looks like this:

<div class="col-span-2 mt-16 sm:mt-4 flex justify-between sm:block space-x-12 font-bold"><span>103 m²</span><span>8&nbsp;650&nbsp;000&nbsp;kr</span></div>

Elements that I want:

<span>103 m²</span>
</span><span>8&nbsp;650&nbsp;000&nbsp;kr</span></div>

They look like this: 103 m² 8 650 000 kr

My simple R code:

# The URL
url = "https://www.finn.no/realestate/homes/search.html?page=%d&sort=PUBLISHED_DESC"

page_outside <- read_html(sprintf(url,1))
                    
element_1 <- page %>% html_nodes("x") %>% html_text()

Anyone got any tips or ideas on how I can access these?

thanks!

Chrisabe
  • 65
  • 5

1 Answers1

1

Here is a possibility, parse out span nodes under a div with class of "justify-between".

url = "https://www.finn.no/realestate/homes/search.html?page=%d&sort=PUBLISHED_DESC"
page_outside <- read_html(sprintf(url,1))

element_1 <- page_outside %>% html_elements("div.justify-between span")
element_1

{xml_nodeset (100)}
 [1] <span>47 m²</span>
 [2] <span>3 250 000 kr</span>
 [3] <span>102 m²</span>
 [4] <span>2 400 000 kr</span>
 [5] <span>100 m²</span>
 [6] <span>10 000 000 kr</span>
 [7] <span>122 m²</span>
 [8] <span>9 950 000 kr</span>
 [9] <span>90 m²</span>
[10] <span>4 790 000 kr</span>
...

Update
If the is some missing data then a slightly longer solution is need to track which element is missing

divs <- page_outside %>% html_elements("div.justify-between")
answer <-lapply(divs, function(node) {
   values <- node %>% html_elements("span") %>% html_text()
   if (length(values)==2) 
      {
      results <- t(values)
   }  else if (grepl("kr", values) ) {
          results <- c(NA, values)
   } else {
          results <- c(values, NA)
       }
      results  
})  
answer <- do.call(rbind, answer)
answer 

       [,1]           [,2]                       
 [1,] "87 m²"        "2 790 000 kr"             
 [2,] "124 m²"       "5 450 000 kr"             
 [3,] "105 m²"       "4 500 000 kr"             
 [4,] "134 m²"       "1 500 000 kr" 
Dave2e
  • 22,192
  • 18
  • 42
  • 50