I'm trying to write a webcrawler that grabs the synonyms for certain words from a thesaurus website and then prints them to a text file. It appears that randomly, after crawling a few links, I'll either get a SocketTimeOutException or a 404 HttpStatusException.
Just to provide background, my code uses a text file with links in it to supply the urls for the webcrawler.
The pattern tends to be, that if there are three or more urls in succession that contain words that aren't found on the thesaurus website, those exceptions are raised. And yes, I am aware that this can be fixed by simply removing the links that aren't to words found in the thesaurus, however my url list is considerably long, so locating and verifying what words are within the thesaurus versus not is kind of out of the question.
import java.util.ArrayList;
import java.util.Scanner;
import org.jsoup.nodes.Document;
import org.jsoup.*;
import java.util.*;
import java.io.*;
public class ThesaurusSpider {
private static final File urlList = new File("C:\\Users\\DaRkD0Ma1N\\Documents\\s.m.a.r.t\\generatedurls.txt");
private static ArrayList<String> urlArrayList = new ArrayList<String>();
public static void CreateUrlArray(ArrayList<String> urlArray, File urlList) throws FileNotFoundException{
Scanner infile = new Scanner(urlList);
while(infile.hasNext()){
urlArrayList.add(infile.nextLine());
}
}
/*
* .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com")
.get();
*/
public static void ExtractData(ArrayList<String> urlArrayList) throws IOException
, InterruptedException{
Document doc = Jsoup.connect("http://www.thesaurus.com/").userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0").referrer("http://www.google.com").timeout(1000).get();
File synonyms = new File("C:\\Users\\DaRkD0Ma1N\\Documents\\s.m.a.r.t\\generated_syns.txt");
PrintWriter pw = new PrintWriter(synonyms);
String test = doc.title();
System.out.println(test);
int counter = 0;
try{
for(String url:urlArrayList){
if(counter == 30){
Thread.sleep(500);
counter = 0;
}else{
Document word_doc = Jsoup.connect(url).get();
if(word_doc.getElementById("words-gallery-no-results") != null || word_doc.select("class.no_results").hasText()){
Thread.sleep(1000*2);
continue;
}
String[] title = word_doc.title().split(" ");
System.out.println(title[0]);
pw.write(title[0] + "\r\n");
if(word_doc.getElementById("synonyms-0") != null){
System.out.println(word_doc.select("
em.txt").get(1).text());
pw.write(word_doc.select("em.txt").get(1).text() + "\r\n");
System.out.print(word_doc.select("span.text").text() + " ");
pw.write(word_doc.select("span.text").text() + " " + "\r\n");
}
System.out.println("");
counter++;
}
}
}catch(HttpStatusException e){
e.printStackTrace();
pw.close();
}
}
public static void PrintArrayList(ArrayList<String> list){
System.out.println(list);
}
public static void main(String[] args) throws IOException, HttpStatusException, InterruptedException{
CreateUrlArray(urlArrayList, urlList);
PrintArrayList(urlArrayList);
ExtractData(urlArrayList);
}
}
The links appear like this within the text file:
http://www.thesaurus.com/browse/Abby?s=t
http://www.thesaurus.com/browse/abdicate?s=t
" "
" "
" "
The links are a collection of words in alphabetical order. Some words can be found in the thesaurus some can't. I have a loop that is supposed to catch and skip over the links that aren't to words within the thesaurus, however I guess it isn't catching all of the wrong links.
I've kind of been banging my head against the wall on this one, so any help/suggestions are appreciated.