0

I am doing a Project in Java. In this project I have to work with DOM. For that I first load a dynamic page of any given URL, by using Selenium. Then I parse them using Jsoup.

I want to get the dynamic page source code of given URL

Code snapshot:

public static void main(String[] args) throws IOException {

     // Selenium
     WebDriver driver = new FirefoxDriver();
     driver.get("ANY URL HERE");  
     String html_content = driver.getPageSource();
     driver.close();

     // Jsoup makes DOM here by parsing HTML content
     Document doc = Jsoup.parse(html_content);

     // OPERATIONS USING DOM TREE
}

But the problem is, Selenium takes around 95% of the whole processing time, that is undesirable.

Selenium first opens Firefox, then loads the given page, then gets the dynamic page source code.

Can you tell me how I can reduce the time taken by Selenium, by replacing this tool with another efficient tool. Any other advice would also be welcome.

Edit NO. 1

There is some code given on this link.

FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("general.useragent.override", "some UA string");
WebDriver driver = new FirefoxDriver(profile);

But what is second line here, I didn't understand. As Documentation is also very poor of selenium.

Edit No. 2

System.out.println("Fetching %s..." + url1); System.out.println("Fetching %s..." + url2);

    WebDriver driver = new FirefoxDriver(createFirefoxProfile());

    driver.get("url1");  
    String hml1 = driver.getPageSource();

    driver.get("url2");
    String hml2 = driver.getPageSource();
    driver.close();

    Document doc1 = Jsoup.parse(hml1);
    Document doc2 = Jsoup.parse(hml2);
devsda
  • 4,112
  • 9
  • 50
  • 87

2 Answers2

1

Try this:

public static void main(String[] args) throws IOException {

    // Selenium
    WebDriver driver = new FirefoxDriver(createFirefoxProfile());
    driver.get("ANY URL HERE");
    String html_content = driver.getPageSource();
    driver.close();

    // Jsoup makes DOM here by parsing HTML content
    // OPERATIONS USING DOM TREE
}

private static FirefoxProfile createFirefoxProfile() {
    File profileDir = new File("/tmp/firefox-profile-dir");
    if (profileDir.exists())
        return new FirefoxProfile(profileDir);
    FirefoxProfile firefoxProfile = new FirefoxProfile();
    File dir = firefoxProfile.layoutOnDisk();
    try {
        profileDir.mkdirs();
        FileUtils.copyDirectory(dir, profileDir);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return firefoxProfile;
}

The createFireFoxProfile() method creates a profile if one doesn't exist. It uses if a profile already exists. So selenium doesn't need to create the profile-dir structure each and every time.

Dakshinamurthy Karra
  • 5,353
  • 1
  • 17
  • 28
  • THanks, wait I will put this module and check how it effects my project. – devsda Apr 05 '13 at 10:17
  • `FileUtils.copyDirectory(dir, profileDir);`. Netbeans says create class FileUtils. I think there is some mistake. Please see. – devsda Apr 05 '13 at 10:21
  • It is from apache commons. Selenium also uses it - so add the jar to your project. – Dakshinamurthy Karra Apr 05 '13 at 10:24
  • I added this dependency ` org.apache.commons commons-io 1.3.2 `. But it throws the same error. – devsda Apr 05 '13 at 10:30
  • I am on eclipse. The JAR I added is commons-io-2.2.jar. – Dakshinamurthy Karra Apr 05 '13 at 10:32
  • Yes, it worked on Netbeans also. But in maven it shows error. – devsda Apr 05 '13 at 10:33
  • Can you tell me what this function is helpful in my case? – devsda Apr 05 '13 at 10:35
  • Time it and see whether it helps. When you create a firefox driver - selenium creates a new profile for the firefox instance. Each time when you create a webdriver, the profile is recreated. This function avoids creating a firefox profile each and every time. – Dakshinamurthy Karra Apr 05 '13 at 10:37
  • Thanks a lot. I am trying to run this on maven. If I find any problem, I will inform you. Thanks again. – devsda Apr 05 '13 at 10:39
  • Now it works on maven also. In my algorithm I have to open two URLS. It performs fine, I first get first URL, then get another URLS. It becomes faster than previous one. Can it become more efficient ? See my code in Edit No. 2. – devsda Apr 05 '13 at 11:13
  • I perform all the instructions of yours. I observe that it takes around `13 - 18 seconds` to get two URL Page source code(dynamic code). But I want this task to perform in around 1 - 2 seconds or any efficient time. How can I get dynamic pages of two URLS in efficient time? Please see my main problem, and this question is the part to make that algorithm efficient. http://stackoverflow.com/questions/15718235/optimized-algorithm-to-compare-templates-of-two-urls – devsda Apr 05 '13 at 12:30
  • If you have few time, then can we discuss my problem via chat, please. – devsda Apr 05 '13 at 12:30
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/27644/discussion-between-kdm-and-jhamb) – Dakshinamurthy Karra Apr 05 '13 at 14:27
  • Can we discuss on chat, please. I need your expert advice. Please give some time. – devsda Apr 06 '13 at 07:46
  • I want to apply HtmlUnit for getting dynamic page source code, because Selenium takes around 13 - 18 seconds to get two URLS page source code. I tried to write code using HtmlUnit, but it is not working. Do you know any good tutorial for the same or help me in writing code, please Give some guidence. – devsda Apr 06 '13 at 10:43
  • I followed your instructions, and perform the same thing using GhostDriver + PhantomJs, but there is not much difference in the time occurs. What can I do now. Here is my code, that have one error, please see the code. – http://stackoverflow.com/questions/15852687/why-code-not-exits-at-the-end-after-closing-the-driver-ghostdriver-phantomjs – devsda Apr 06 '13 at 19:45
  • I am ready with my code, and it ran fine for two urls, but when I test the same for the numbers of urls, it fails, Can you please help me ? The question link is http://stackoverflow.com/questions/16075837/shows-exception-in-java-code-selenium-jsoup – devsda Apr 18 '13 at 07:31
0

if you are sure, confident about your code, you can go with phantomjs. it is a headless browser and will get your results with quick hits. FF will take time to execute.

divine
  • 4,746
  • 3
  • 27
  • 38
  • 1
    this late answer may be of short usage. One of the comment of devnull on 6 april 2013 was: "I followed your instructions, and perform the same thing using GhostDriver + PhantomJs, but there is not much difference in the time occurs" – aberna Feb 23 '15 at 14:18