Don't scrape GitHub. Scraping is fragile, and very awkward with sites that make heavy use of JavaScript.
Use its API instead:
https://api.github.com/search/code?q=https%3A%2F%2Fsonarcloud.io%2Fdashboard
To search across all repositories you'll still need to authenticate, though. You need to pass your auth
hash into HTTParty.get()
:
auth = {:username => "username", :password => "password"}
a = HTTParty.get(url, :basic_auth => auth)
# ^ Here
More idiomatically, this might look like
auth = {username: "username", password: "password"}
a = HTTParty.get(url, basic_auth: auth)
You also have a typo—usarname
instead of username
—which I've fixed in my version.
Edit: If you want to retrieve the specific matched text, file, and lines you still don't have to scrape their HTML. Instead, you can set your Accept
header to application/vnd.github.v3.text-match+json
:
url = "https://api.github.com/search/code"
query = {q: "https://sonarcloud.io/dashboard"}
auth = {username: "username", password: "password"}
headers = {"Accept" => "application/vnd.github.v3.text-match+json"}
a = HTTParty.get(url, query: query, basic_auth: auth, headers: headers)
The response should now give a text_matches
key containing hashes with fragment
s showing the matched text as well as object_type
s (e.g. "FileContent"
), object_url
s, and indices
.
This is also mentioned in the search code link I already provided:
When searching for code, you can get text match metadata for the file content and file path fields when you pass the text-match
media type. For more details about how to receive highlighted search results, see Text match metadata.