I'm continuing a project that has been coming for a few years at my university. One of the activities this project does is to collect some web pages using the google bot.
Due to a problem that I cannot understand, the project is not getting through this part. Already research a lot about what may be happening, if it is some part of the code that is outdated.
The code is in Java and uses Maven for project management. I've tried to update some information from maven's "pom". I already tried to change the part of the code that uses the bot, but nothing works.
I'm posting the part of code that isn't working as it should:
private List<JSONObject> querySearch(int numSeeds, String query) {
List<JSONObject> result = new ArrayList<>();
start=0;
do {
String url = SEARCH_URL + query.replaceAll(" ", "+") + FILE_TYPE + "html" + START + start;);
Connection conn = Jsoup.connect(url).userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)").timeout(5000);
try {
Document doc = conn.get();
result.addAll(formatter(doc);
} catch (IOException e) {
System.err.println("Could not search for seed pages in IO.");
System.err.println(e);
} catch (ParseException e) {
System.err.println("Could not search for seed pages in Parse.");
System.err.println(e);
}
start += 10;
} while (result.size() < numSeeds);
return result;
}
what some variables do:
private static final String SEARCH_URL = "https://www.google.com/search?q=";
private static final String FILE_TYPE = "&fileType=";
private static final String START = "&start=";
private QueryBuilder queryBuilder;
public GoogleAjaxSearch() {
this.queryBuilder = new QueryBuilder();
}
Until this part is ok, it connect with the bot and can get a html from google. The problem is to separate what found and take only the link, that should be between ("h3.r> a"). That it does in this part with the result.addAll(formatter(doc)
public List<JSONObject> formatter(Document doc) throws ParseException {
List<JSONObject> entries = new ArrayList<>();
Elements results = doc.select("h3.r > a");
for (Element result : results) {
//System.out.println(result.toString());
JSONObject entry = new JSONObject();
entry.put("url", (result.attr("href").substring(6, result.attr("href").indexOf("&")).substring(1)));
entry.put("anchor", result.text());
So when it gets to this part: Elements results = doc.select ("h3.r> a"), find, probably, no h3 and can't increment the "results" list by not entering the for loop. Then goes back to the querysearch function and try again, without increment the results list. And with that, entering in a infinite loop trying to get the requested data and never finding.
If anyone here can help me, I've been trying for a while and I don't know what else to do. Thanks in advance.