How to authenticate GitHub with HTTParty to scrape a page?

Question

I'm trying to scrape this page:

https://github.com/search?p=1&q=https%3A%2F%2Fsonarcloud.io%2Fdashboard&type=Code

and I need to authenticate with my email and password.

I have tried to do that:

auth = {:usarname => "username", :password => "password"}

a = HTTParty.get(url, :basic_auth)

but this didn't authenticate me as expected.

Why isn't this working, and how can I fix it?

I want retrive that information, and isn't available on the Github API:

What is "that information"? You're looking for the specific files and lines that match your search? — ChrisGPT was on strike, May 10 '19 at 16:10

ChrisGPT was on strike · Answer 1 · 2019-05-10T16:58:23.687

0

Don't scrape GitHub. Scraping is fragile, and very awkward with sites that make heavy use of JavaScript.

Use its API instead:

https://api.github.com/search/code?q=https%3A%2F%2Fsonarcloud.io%2Fdashboard

To search across all repositories you'll still need to authenticate, though. You need to pass your auth hash into HTTParty.get():

auth = {:username => "username", :password => "password"}

a = HTTParty.get(url, :basic_auth => auth)
#                                 ^  Here

More idiomatically, this might look like

auth = {username: "username", password: "password"}

a = HTTParty.get(url, basic_auth: auth)

You also have a typo—usarname instead of username—which I've fixed in my version.

Edit: If you want to retrieve the specific matched text, file, and lines you still don't have to scrape their HTML. Instead, you can set your Accept header to application/vnd.github.v3.text-match+json:

url = "https://api.github.com/search/code"
query = {q: "https://sonarcloud.io/dashboard"}
auth = {username: "username", password: "password"}
headers = {"Accept" => "application/vnd.github.v3.text-match+json"}

a = HTTParty.get(url, query: query, basic_auth: auth, headers: headers)

The response should now give a text_matches key containing hashes with fragments showing the matched text as well as object_types (e.g. "FileContent"), object_urls, and indices.

This is also mentioned in the search code link I already provided:

When searching for code, you can get text match metadata for the file content and file path fields when you pass the text-match media type. For more details about how to receive highlighted search results, see Text match metadata.

edited May 10 '19 at 16:58

answered May 10 '19 at 14:40

ChrisGPT was on strike

127,765
105
273
257

The typo was when I was writing here. I understand that scrape can be messy, but is the only way to retrive the informations that i need. Github API, dont return the content informations, just links. – Guilherme Freitas May 10 '19 at 15:39
I don't know what that edit is talking about, but this answer still applies. Even if you decide to scrape (again, avoid this if at all possible; it probably is) you need to pass your `auth` hash into the `HTTParty.get()` call. – ChrisGPT was on strike May 10 '19 at 15:58
I did just like you told and still don't work. The Scrape returns a sign in page. – Guilherme Freitas May 10 '19 at 16:05
@GuilhermeFreitas, you _really_ should be using the API. I think I understand what you were saying in your edit. Please see if my new answer works. Make sure to use the API URL, not the regular search URL. – ChrisGPT was on strike May 10 '19 at 16:49
API dont work for what I'm trying to do. This link https://developer.github.com/v3/repos/contents/ shows the return of Github API, and the content that I'm trying to get is encoded. – Guilherme Freitas May 10 '19 at 22:11
And no, still not authenticating. – Guilherme Freitas May 10 '19 at 22:13
@GuilhermeFreitas, then _please **clearly explain** what you want to do_. Your question is _unclear_. Please respond to my question above about what "that information" is, and how it isn't included in the API response. See [ask]. – ChrisGPT was on strike May 11 '19 at 00:09
@GuilhermeFreitas, also, that link you provide is for a completely different API endpoint than the one I'm suggesting. Why are you talking about the repository contents API? There's a search API that I recommended in my original answer four days ago. – ChrisGPT was on strike May 13 '19 at 18:43

How to authenticate GitHub with HTTParty to scrape a page?

1 Answers1