I cannot web-scrape forbes top billionares website with colly go

Question

package main

import (
"encoding/csv"
"fmt"
"os"

"github.com/gocolly/colly"
)

 func checkError(err error){
 if err!=nil{
    panic(err)
}
}
func main(){
fName:="data.csv"
file,err:=os.Create(fName)
checkError(err)
defer file.Close()
writer:=csv.NewWriter(file)
defer writer.Flush()
c:=colly.NewCollector(colly.AllowedDomains("forbes.com","www.forbes.com"))
c.OnHTML(".scrolly-table tbody tr", func(e *colly.HTMLElement) {
        writer.Write([]string{
            e.ChildText(".rank .ng-binding"),
        })
    })  
    c.OnError(func(_ *colly.Response, err error) {
        fmt.Println("Something went wrong:", err)
    })
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL)
    })
    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Visited", string(r.Body))
    })
    c.Visit("https://forbes.com/real-time-billionaires/")
     }

This is my code, when i requested i am getting the fallback page ,This is the link for forbes that i am trying to scrape

I have noticed that the website uses hash path which is at the last part of url and i cannot request with the same url twice, and i think its somehow related to scraping, can anyone help me with this?

using curl on `https://www.forbes.com/real-time-billionaires/` gave me a 301 response, but using the url with the hash `...#2e4fa2853d78` gave me some content that looked like the top text and a bunch of javascript that presumably populates the table. I could use the same url with curl multiple times. Also, a hint from my experience in scraping is that sometimes a server will block clients that have a "nonstandard" user agent, and I just supply the user agent from my browser with my request. I would explore the JS and see if you can scrape their API directly. — Benny Jobigan, Nov 01 '21 at 13:53

score 3 · Accepted Answer · answered Nov 01 '21 at 05:23

3

Make sure what is available if you disable javascript in your browser (you can do it using the developer tools). Most scrapers will only get you the textual representation of the page, while the browser will also run javascript engine against it. If the data you are trying to scrape is populated with Javascript, there is a very good chance that is the reason you can't scrape it.

answered Nov 01 '21 at 05:23

jabbson

4,390
1
13
23

ohh, do you know any way to scrape in colly after the dom is hydrated? – Rahul R Nair Nov 01 '21 at 05:38
1

This was discussed in their GitHub https://github.com/gocolly/colly/issues/4 – jabbson Nov 01 '21 at 05:49

score 0 · Answer 2 · answered Nov 01 '21 at 13:56

0

Colly can only be used for static scraping, chromedp can be used for scraping client side rendered applications.

answered Nov 01 '21 at 13:56

Rahul R Nair

39
9

I cannot web-scrape forbes top billionares website with colly go

2 Answers2