0

I've been using crawler4j for a few months now. I recently started noticing that it hangs on some of the sites to never return. The recommended solution is to set resumable to true. This is not an option for me as I am limited on space. I ran multiple tests and noticed that the hang was very random. It will crawl between 90-140 urls and then stop. I thought maybe it was the site but there is nothing suspicious in the sites robot.txt and all pages respond with 200 OK. I know the crawler hasn't crawled the entire site otherwise it would shutdown. What could be causing this and where should I start?

whats interesting is that i start crawlers with nonBlocking and after is a while loop checking status

controller.startNonBlocking(CrawlProcess.class, numberOfCrawlers);

while(true){
  System.out.println("While looping");
}

when the crawler hangs the while loop also stops responding but the thread is still alive. Which means that the entire thread is not responsive. Therefore, I am unable to send a shutdown command.

UPDATE I figured out what is causing it to hang. I run a store in mysql step in the visit method. The step looks like this:

public void insertToTable(String dbTable, String url2, String cleanFileName, String dmn, String AID, 
        String TID, String LID, String att, String ttl, String type, String lbl, String QL,
        String referrer, String DID, String fp_type, String ipAddress, String aT, String sNmbr) throws SQLException, InstantiationException, IllegalAccessException, ClassNotFoundException{
    try{
        String strdmn = "";
        if(dmn.contains("www")){
            strdmn = dmn.replace("http://www.","");
        }else{
            strdmn = dmn.replace("http://","");
        }
        String query = "INSERT INTO "+dbTable
                +" (url,filename, dmn, AID, TID, LID, att, ttl, type, lbl, tracklist, referrer, DID, searchtype, description, fp_type, ipaddress," +
                " aT, sNmbr, URL_Hash, iteration)VALUES('"
                +url2+"','"+cleanFileName+"','"+strdmn+"','"+AID+"','"+TID+"','"+LID+"','"+att+"','"+ttl+"','"+type+"'" +
                ",'"+lbl+"','"+QL+"','"+dmn+"','"+DID+"','spider','"+cleanFileName+"','"+fp_type+"'," +
                "'"+ipAddress+"','"+aT+"','"+sNmbr+"',MD5('"+url2+"'), 1) ON DUPLICATE KEY UPDATE iteration = iteration + 1";
        Statement st2 = null;
        con = DbConfig.openCons();
        st2 = con.createStatement();
        st2.executeUpdate(query);
        //st2.execute("SELECT NOW()");
        st2.close();
        con.close();
        if(con.isClosed()){
            System.out.println("CON is CLOSED");
        }else{
            System.out.println("CON is OPEN");
        }
        if(st.isClosed()){
            System.out.println("ST is CLOSED");
        }else{
            System.out.println("ST is OPEN");
        }
    }catch(NullPointerException npe){
        System.out.println("NPE: " + npe);
    }
}

what's very interesting is when I run the st2.execute("SELECT NOW()"); instead of the current st2.execute(query); it works fine and crawls the site without hanging. But for some reason st2.execute(query) causes it to hang after a few queries. It's not mysql because it doesn't output any exceptions. i thought maybe im getting a "too many connections" from mysql but that isn't the case. Does my process make sense to anyone?

Salim
  • 199
  • 3
  • 18
  • Debugger, or, thread dump. – djechlin Jul 18 '14 at 16:48
  • What can I use to debug other than eclipse? I use MAT to get heap dump but that doesn't give me anything useful. What's odd is it doesn't hang for all domains. For some domains it crawls the entire site without a problem. – Salim Jul 18 '14 at 16:51
  • 1
    Thread dump will tell you where in crawler4j it's hanging. – djechlin Jul 18 '14 at 16:52
  • ah, ok. I will start with that. Is a MAT dump the same as thread dump? How do you recommend getting a thread dump? – Salim Jul 18 '14 at 17:12
  • I don't know what a MAT dump is. Visual VM or send kill -3 to the process; it will dump it to stderr. – djechlin Jul 18 '14 at 17:15
  • is kill -3 in linux or java System.exit()? I don't see -3 as an option in linux. – Salim Jul 18 '14 at 18:39
  • dunno. it works. google "java kill -3" – djechlin Jul 18 '14 at 19:38
  • thank you for the guidance. It turned out to be c3p0 that was not responding because of a connection leak. should i delete this question? – Salim Jul 21 '14 at 17:15
  • 1
    I'm leaning toward keeping it, since other people using crawler4j could have this problem. – djechlin Jul 21 '14 at 17:39
  • 1
    This is an *extremely* common cause of a variety of related problems. Try googling " hanging randomly" and you'll get the same. but by that token it's still useful if someone has your problem with crawler4j, doesn't already know to check for blocked connections as a debugging step, then they'll end up here (and that's all likely) – djechlin Jul 21 '14 at 17:40

1 Answers1

2

The importance of a finally block.

The crawler4j is using c3p0 pooling to insert into mysql. After a few queries the crawler would stop responding. It turned out to be a connection leak in c3p0 thanks to @djechlin's advice. I added a finally block like below and it works great now!

try{
   //the insert method is here
}catch(SQLException e){
  e.printStackTrace();
}finally{
  if(st != null){
    st.close();
  }
  if(rs != null){
   rs.close();
  }

}
Salim
  • 199
  • 3
  • 18