1

One service I'm using doesn't have an API, but allows scraping, so I'm curious what the best way in iOS/Objective-C would be to do the following:

  • Get user login credentials
  • Submit them on the websites login page
  • Grab specific links from the resulting page

How does one circumvent issues such as the fact that the service does redirects you to a "Login successful, redirecting..." page before taking you to the content site? (This doesn't allow you to immediately scrape the resulting page.)

For example:

A service like Instapaper, if I wanted to access it without directly using the API, for example, how would I login, verify that they were logged in, and scrape the content after the "Login successful, redirecting..." page? Or Twitter even.

Gabriele Petronella
  • 106,943
  • 21
  • 217
  • 235
user212541
  • 1,878
  • 1
  • 21
  • 30

2 Answers2

3

A valid approach would be to perform the scraping inside a UIWebView.

The strategy is pretty straightforward and it involves the usage of the method stringByEvaluatingJavaScriptFromString of UIWebView to control the webpage.

Assuming that you have already the user login info, you can input them using a javascript script.

For instance, assuming that webView is the UIWebView instance and username is the username input field:

NSString * usernameScript = @"document.getElementById('username').value='Gabriele';";
[self.webView stringByEvaluatingJavaScriptFromString:usernameScript];

The above code will insert Gabriele in the username field.

Along on the same path you can easily proceed and automatically interact with the webpage via javascript injections.

Once you are logged in, you can monitor for the current URL, until the redirection gets you to desired point. In order to do this, you have to implement the webViewDidFinishLoad: method of UIWebViewDelegate, which will be called each time the web view load a page

- (void)webViewDidFinishLoad:(UIWebView *)webView {
    NSURL * currentURL = webView.request.mainDocumentURL;
    if ([currentURL.absoluteString isEqual:desideredURLAddress]) {
        [self performScraping];
    }
}

At this point you can perform the actual scraping. Say that you want to get the content of a div tag whose id is foo. That's as simple as doing

- (void)performScraping {
     NSString * fooContentScript = @"document.getElementById('foo').innerHTML;";
     NSString * fooContent = [self.webView stringByEvaluatingJavaScriptFromString:usernameScript];
}

This will store the innerHTML content of the div#foo inside the fooContent variable.

Bottom line, injecting javascript inside a UIWebView you can control and scrape whatever web page.

For extra joy, you can perform all this off screen. To do so, allocate a new UIWindow and add the UIWevView as its subview. If you never make the UIWindow visibile, everything described above will happen off screen.

Note that this approach is very effective, but it can be resource consuming, since you are loading the whole content of each web page. However, this can often be a necessary compromise, since other approaches based on XML parsers are likely to be inadequates due to the fact that HTML pages are often malformed, and most XML parsers are simply to strict to parse them.

Gabriele Petronella
  • 106,943
  • 21
  • 217
  • 235
0

There is nothing specific to iOS or Objective-C in what you are trying to do. If you know how to process HTTP responses and know how to detect your login page, all you have to do is parse the response and submit credentials to the login end point when you detect the response is your login page. Before you get started, do read the documentation on NSURLConnection.

sixthcent
  • 1,150
  • 7
  • 6
  • My problem is when I submit the credentials, I don't know how to then grab the HTML from the following page that will soon appear, as there's a temp screen in between. – user212541 Apr 12 '13 at 22:49
  • If you received the login page, spawn a separate connection to complete the login before returning to UI. This way, you will be authenticating in the background before moving to the next screen. – sixthcent Apr 12 '13 at 22:54
  • I bring up the log in page myself. They input their credentials and hit login. I then authenticate in the background. How do I know the authentication was successful? I can't check the next page that it brings up (the response page) as it's a "Logging in... one moment" screen. – user212541 Apr 12 '13 at 23:45