0

I am working on Apache Nutch modification project. We already swapped Nutch's original module with ours built using HtmlUnit. I need to download whole Facebook user site (ex. http://www.facebook.com/profile.php?id=100002517096832), which is going to be parsed using our own parser. Unfortunately Facebook is using mechanism called BigPipe (http://www.facebook.com/note.php?note_id=389414033919). That's why most of current website is hidden in <.!-- --> tags. Usually when we scroll down Facebook page, new content is being unpacked every time we are about to hit bottom of the page. I have tried to use Javascript that scroll my htmlPage (HtmlPage object from HtmlUnit project), but finally I realized that scrolling is not triggering loading new content on Facebook user site.

How can I check, what event on page triggers loading content on current Facebook page? Maybe I should approach problem from different side, for example try to extract BigPipe "things" on my own? Have you ever did that?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
igleyy
  • 605
  • 4
  • 16

1 Answers1

0

Before dealing to your question … what kind of project are you trying to build there?

Since Apache Nutch is an open source web-search software, I think you are trying to build some kind of search engine, that scrapes Facebook user profiles/feeds to get data and make it searchable on some third-party website?

Well, that would be a violoation of Facebook Platform Policies:

I. Features and Functionality

12. You must not include data obtained from us in any search engine or directory without our written permission.

So, do you have that written permission?

Community
  • 1
  • 1
CBroe
  • 91,630
  • 14
  • 92
  • 150
  • 1
    Facebook is just an example of website deeply using AJAX and it is the only purpose why I use it. I am not going to use any data that I can get from Facebook. That kind of mechanisms are used often at other websites, for example: http://imgur.com/gallery – igleyy Sep 18 '12 at 19:26