2

I've been debugging this for a long time and it has me completely baffled. I need to save ads to my computer for a work project. Here is an example ad that I got from CNN.com:

http://ads.cnn.com/html.ng/site=cnn&cnn_pagetype=main&cnn_position=300x250_rgt&cnn_rollup=homepage&page.allowcompete=no&params.styles=fs&Params.User.UserID=5372450203c5be0a3c695e599b05d821&transactionID=13999976982075532128681984&tile=2897967999935&domId=6f4501668a5e9d58&kxid=&kxseg=

When I visit this link in Google Chrome and Firefox, I see an ad (if the link stops working, simply go to CNN.com and grab the iframe URL for one of the ads). I developed a PhantomJS script that will save a screenshot and the HTML of any page. It works on any website, but it doesn't seem to work on these ads. The screenshot is blank and the HTML contains a tracking pixel (a 1x1 transparent gif used to track the ad). I thought that it would give me what I see in my normal browser.

The only thing that I can think of is that the AJAX calls are somehow messing up PhantomJS, so I hard-coded a delay but I got the same results.

Here is the most basic piece of test code that reproduces my problem:

var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];

page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to load the address!');
        phantom.exit();
    }
    else {
        // Output Results Immediately
        var html = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML;
        });
        fs.write("HtmlBeforeTimeout.htm", html, 'w');
        page.render('RenderBeforeTimeout.png');

        // Output Results After Delay (for AJAX)
        window.setTimeout(function () {
            var html = page.evaluate(function () {
                return document.getElementsByTagName('html')[0].innerHTML;
            });
            fs.write("HtmlAfterTimeout.htm", html, 'w');
            page.render('RenderAfterTimeout.png');
            phantom.exit();
        }, 9000); // 9 Second Delay 
    }
});

You can run this code using this command in your terminal:

phantomjs getHtml.js 'http://www.google.com/'

The above command works well. When you replace the Google URL with an Ad URL (like the one at the top of this post), is gives me the unexpected results that I explained.

Thanks so much for your help! This is my first question that I've ever posted on here, because I can almost always find the answer by searching Stack Overflow. This one, however, has me completely stumped! :)

EDIT: I'm running PhantomJS 1.9.7 on Ubuntu 14.04 (Trusty Tahr)

EDIT: Okay, I've been working on it for a while now and I think it has something to do with cookies. If I clear all of my history and view the link in my browser, it also comes up blank. If I then refresh the page, it displays fine. It also displays fine if I open it in a new tab. The only time it doesn't is when I try to view it directly after clearing my cookies.

EDIT: I've tried loading the link twice in PhantomJS without exiting (manually requesting it twice in my script before calling phantom.exit()). It doesn't work. In the PhantomJS documentation it says that the cookie jar is enabled by default. Any ideas? :)

  • This is quite a stumper! No matter what I try, I only get the 1x1 black image. I wonder if it has anything to do with the fact that some ads run in an embedded Flash player? PhantomJS no longer supports Flash as of version 1.5 to allow PhantomJS to run completely headless without the need for xvfb. Something worth trying is SlimerJS, uses Gecko engine and not Webkit, which supports Flash and has virtually the same API as PhantomJS. – Cameron Tinker May 15 '14 at 14:30

1 Answers1

1

You should try using the onLoadFinished callback instead of checking for status in page.open. Something like this should work:

var fs = require('fs');
var page = require('webpage').create();
var url = phantom.args[0];

page.open(url);

page.onLoadFinished = function()
{
    // Output Results Immediately
    var html = page.evaluate(function () {
        return document.getElementsByTagName('html')[0].innerHTML;
    });
    fs.write("HtmlBeforeTimeout.htm", html, 'w');
    page.render('RenderBeforeTimeout.png');

    // Output Results After Delay (for AJAX)
    window.setTimeout(function () {
        var html = page.evaluate(function () {
            return document.getElementsByTagName('html')[0].innerHTML;
        });
        fs.write("HtmlAfterTimeout.htm", html, 'w');
        page.render('RenderAfterTimeout.png');
        phantom.exit();
    }, 9000); // 9 Second Delay 
};

I have an answer here that loops through all files in a local folder and saves images of the resulting pages: Using Phantom JS to convert all HTML files in a folder to PNG The same principle applies to remote HTML pages.

Here is what I have from the output:
Before Timeout:
https://i.stack.imgur.com/GmsH9.jpg

After Timeout:
https://i.stack.imgur.com/mo6Ax.jpg

Community
  • 1
  • 1
Cameron Tinker
  • 9,634
  • 10
  • 46
  • 85
  • seems reasonable especially since all you are loading to begin with is a bunch of JS to render an iframe later, the timing here becomes important to what is being rendered. – dbrin May 13 '14 at 18:25
  • I was just moving the OP's original code to the onLoadFinished callback. The 9 second delay is as the OP had it. – Cameron Tinker May 13 '14 at 18:28
  • sorry the comment was for the OP. I was agreeing with you :) – dbrin May 13 '14 at 18:30
  • Thanks for your quick reply! I copied and pasted your code into a new file and tested it. The output was exactly the same as mine (it worked fine with the Google link but didn't work with the Ad from CNN). Do you have any other ideas? :) – Jared Carter May 13 '14 at 18:39
  • I've added the output from my run. CNN is quite long so I apologize for the long post haha. I can edit the answer to link to the images to reduce the size of the answer. – Cameron Tinker May 13 '14 at 18:50
  • I just saw your screenshots from the CNN.com home page. Could you please try this Ad from CNN instead: http://ads.cnn.com/html.ng/site=cnn&cnn_pagetype=main&cnn_position=300x250_rgt&cnn_rollup=homepage&page.allowcompete=no&params.styles=fs&Params.User.UserID=5372450203c5be0a3c695e599b05d821&transactionID=13999976982075532128681984&tile=2897967999935&domId=6f4501668a5e9d58&kxid=&kxseg= It comes from the iFrame on the right side of the page. The isolated ad works in Chrome but I cannot seem to get it to render in PhantomJS. – Jared Carter May 13 '14 at 18:51
  • I'm coming up blank here too (see what I did there?). Where did you get the Ad link from? My guess is that you're not sending all the required HTTP headers/POST data to the Ad url. – Cameron Tinker May 13 '14 at 18:59
  • Just to clarify, this is the terminal command that doesn't output anything using the code from either of our posts: phantomjs getHtml.js 'http://ads.cnn.com/html.ng/site=cnn&cnn_pagetype=main&cnn_position=300x250_rgt&cnn_rollup=homepage&page.allowcompete=no&params.styles=fs&Params.User.UserID=5372450203c5be0a3c695e599b05d821&transactionID=13999976982075532128681984&tile=2897967999935&domId=6f4501668a5e9d58&kxid=&kxseg=' – Jared Carter May 13 '14 at 18:59
  • Haha! The link is from the iframe on the right side of the CNN.com home page (it doesn't appear in the screenshot but it does appear when you visit it in a browser). In the .htm file created by PhantomJS, search for "Advertisement" and look a few lines above it. There is an iframe tag with the link in the src attribute. – Jared Carter May 13 '14 at 19:02