I am working on Apache Nutch modification project. We already swapped Nutch's original module with ours built using HtmlUnit. I need to download whole Facebook user site (ex. http://www.facebook.com/profile.php?id=100002517096832), which is going to be parsed using our own parser. Unfortunately Facebook is using mechanism called BigPipe (http://www.facebook.com/note.php?note_id=389414033919). That's why most of current website is hidden in <.!-- -->
tags.
Usually when we scroll down Facebook page, new content is being unpacked every time we are about to hit bottom of the page. I have tried to use Javascript that scroll my htmlPage
(HtmlPage
object from HtmlUnit
project), but finally I realized that scrolling is not triggering loading new content on Facebook user site.
How can I check, what event on page triggers loading content on current Facebook page? Maybe I should approach problem from different side, for example try to extract BigPipe "things" on my own? Have you ever did that?