5

Revised (clarified question)

I've spent a few days already trying to figure out how to scrape specific information from a facebook game; however, I've run into brick wall after brick wall. As best as I can tell, the main problem is as follows. I can use Chrome's inspect element tool to manually find the html that I need - it appears nestled inside an iframe. However, when I try and scrape that iframe, it is empty (except for properties):

<iframe id="game_frame" name="game_frame" src="" scrolling="no" ...></iframe>

This is the same output that I see if I use a browsers "View page source" tool. I don't understand why I can't see the data in the iframe. The answer is NOT that it's being added afterwards by AJAX. (I know that both because "View page source" can read data that's been added by Ajax and also because I've b/c I've waited until after I can see the data page before scraping it and it's still not there).

Is this happening because of facebook's anti-screen scraping, and if so is there a way around it? Or am I just missing something. I program in ruby and I've tried nokogiri, then mechanize, then capybara without success.

I don't know if it makes any difference, but it seems to me that the iframe is getting it's data using the iframe's "game_frame" reference which apparently refers to this piece of html that appears earlier in the document:

<form id="hidden_login_form_1331840407" action="" method="POST" target="game_frame">
  <input type="hidden" name="signed_request" autocomplete="off" value="v6kIAsKTZa...">
  ...
</form>

Original question

I wrote a ruby program that uses nokogiri to scrape data from a facebook game's HTML. Currently, I get the HTML by using chrome's "inspect element" tool and I save it to a file and parse it from there. However, I would really like to be able to access the information from within ruby. For example, I would pass the program the page name "www.gamename.com/...?id=12345" and it would login to facebook, go to that page and scrape the data. Currently, if I try that, it doesn't work because I get redirected to facebook's login page. How can I get past the login screen to access the page(s) I need?

I would like to do this using the nokogiri code that I have already written; however, if I have to I could rewrite it using something else. Currently, the program is a standalone program - not a rails program - but I could change that. I've see some information that might point me in the direction of Omniauth but I'm not sure that's what I'm looking for and it also looks very complicated. I'm hoping there's a simpler solution.

Thanks

2 Answers2

6

I can recommend capybara-webkit for this kind of task. It uses QtWebkit under the hood and understands Javascript:

require 'capybara-webkit'
require 'capybara/dsl'
require 'nokogiri'

include Capybara::DSL
Capybara.current_driver = :webkit

# login
visit("https://www.facebook.com")
find("#email").set("user")
find("#pass").set("password")
find("#loginbutton//input").click

# navigate to the JS-generated page
visit("www.gamename.com/...?id=12345")

# parse HTML
doc = Nokogiri::HTML.parse(body)
Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • Although i could not get webkit to work b/c of windows gem building problems, I was able to use Capybara to get the information I needed. The biggest sticky point was that because the info I needed was contained within a frame, it did not appear in the HTML for the main page. However, I finally realized that if I used the within_frame method, I would be able to access the info within the frame and this worked. – Mike Schachter Mar 19 '12 at 21:36
4

The easiest is to use mechanize:

require 'mechanize'
@agent = Mechanize.new{|a| a.user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
page = @agent.get 'http://www.facebook.com/'
form = page.forms[0]
form['email'], form['pass'] = 'me@myemail.com', 'foobar'
form.submit
# now you're logged in and a request like this:
doc = @agent.get('http://www.facebook.com/').parser
# gives you a logged in Nokogiri::HTML::Document like you're used to
pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • I have used RestClient to do something similar, albeit you require to manage cookie, redirects etc in RestClient. mechanize looks like a good candidate to ease these tasks. – ch4nd4n Mar 14 '12 at 08:34
  • depending on the application, missing JavaScript support could be a showstopper. – Niklas B. Mar 14 '12 at 12:54
  • I tried this and it seems to work great for logging in. Thanks. However, it doesn't seem to be solving my specific problem. The information that I'm looking for is inside a hidden form which is not being read by Nokogiri. Even the standard "view source" web browser option cannot see the contents of the hidden form, only that there is a hidden form. Nokogiri does not even see that. Only Chrome's inspect element tool seems to be able to see the information. I don't know enough to understand what this means or how to deal with this. – Mike Schachter Mar 14 '12 at 14:31
  • @Mike: It means that the contents are created using Javascript, which mechanize does not understand. That's why a proposed a Javascript-aware solution. – Niklas B. Mar 14 '12 at 17:12
  • @mike what I would do is make the request in FF/Chrome to identify the value you're looking for. Then find it in your browser's network panel. When you identify which request has the hidden value just imitate the request with mechanize. You could also try a debugging proxy such as charles or fiddler. – pguardiario Mar 14 '12 at 22:23
  • After spending a while trying to find any kind of info on how to use capybara, I finally found what seems to be some useful info here http://richardconroy.blogspot.com/2010/08/capybara-reference.html. However then when I tried to install capybara-webkit I get "failed to build native extension errors". i've spent an hour trying different solutions i find online to fix this with no luck. so i'll have to spend some more time on this and/or looking into pguardiario's suggestions. – Mike Schachter Mar 15 '12 at 03:44
  • 1
    @mike - There are other browser automation options than capybara. I use watir-webdriver sometimes but as a last resort for performance and portability reasons. The data is there and mechanize will get it for you, you just need to search for it. – pguardiario Mar 15 '12 at 04:23
  • @pguardiario: capybara is not a browser automation option. While it can use Selenium as a backend, it can also use rack-test, Webkit and even mechanize. I agree that if REing of the AJAX load is possible, this is a good option as well. – Niklas B. Mar 16 '12 at 01:21
  • @Mike: Better read the official docs. capybara-webkit should install without problems on Linux, given that you have the dependencies (Qt4, qmake) installed. On Windows, it seems to be a bit trickier, yet possible. – Niklas B. Mar 16 '12 at 01:24
  • Please take a look at my revised question. It should clarify where the problem is. – Mike Schachter Mar 16 '12 at 16:05
  • @mike - I understand where the problem is, your options are to locate the data you want in the mechanize response or use browser automation to get it (I still say capybara webkit is a browser) – pguardiario Mar 17 '12 at 03:45
  • @pguardiario - I've been trying to follow your suggestions but I've been having a hard time (it certainly doesn't help that installing certain gems on windows can be really difficult sometimes). Can you check out my newest edit and tell me what you think? Also, I don't understand how mechanize can help me at this point since mechanize cannot handle AJAX? Thanks. – Mike Schachter Mar 19 '12 at 05:54
  • @mike - Have you considered hiring somebody to do this for you? I don't have any more advice other than what I've already said. The data is there, you just need to search for it. – pguardiario Mar 19 '12 at 06:54
  • @pguardiario I finally figured it out. Thanks for the help and for insisting that the data was there - i just needed to figure out how to get it. However, I never did find a way to get mechanize to work with the AJAX. – Mike Schachter Mar 19 '12 at 21:37
  • @mike - good, I'm glad you got it working. For the record it is not correct to say that mechanize cannot "handle ajax". An ajax request is the same as any other request and so of course mechanize can handle that. What it can't do is interpret the javascript on a page that makes an ajax call. – pguardiario Mar 19 '12 at 22:37