1

In my trial, I hit splash instance with 50 parallel threads. Each thread will get the page source of the URL. My splash instance default slots value is 50. Here, website fetching time increases exponentially with the number of parallel threads. I can get the perfect HTML source for 50 URLs. But the time increases from 2 seconds to 45 seconds from 1st URL to 50th URL respectively. Please help me to reduce the time for fetching the page source.

My sample java code is

public class SplashThread implements Runnable {

private static final String splash ="http://localhost:8050/execute";

private String script = URLEncoder.encode("function main(splash, args)  \n" +
        "splash:go(args.url)\n"+
        "  splash.images_enabled = false\n" +
        "  splash:on_request(function(request)\n" +
        "    if string.find(request.url, \".css\")~= nil then\n" +
        "        request.abort()\n" +
        "    end\n" +
        "end)\n" +
        "local html = splash:html()\n" +
        "return  html\n"+
        "end","UTF-8");

private String url =null;

public SplashThread(String url) throws UnsupportedEncodingException {
    this.url = url;

}
@Override
public void run() {
    HttpClientUtil clientUtil =null;
    JSONObject json =null;
    try {
        Properties queryParms = new Properties();
        queryParms.put("url",url);
        queryParms.put("timeout","85.0");
        queryParms.put("lua_source",script);

        clientUtil = new HttpClientUtil(-1,false);
        HttpResponse response = clientUtil.doGet(splash,queryParms,null);
        String resp = ScrapyUtil.getResponseString(response,"UTF-8");


    }
    catch (Exception e){
        e.printStackTrace();
        System.out.println("JSN}ON :: "+json);
    }
    finally {
        if(clientUtil!=null){
            try {
                clientUtil.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

    }

}

}

I am scheduling 50 threads of this runnable object with ScheduledExecutorService.

If I am fecthing the page source once by one, It will working perfectly. But I need to fecth concurrently.

0 Answers0