I'm working on web scraper and I can't solve problem I'm having for the second day in row.
The problem with this method is when the bot is supposed to visit the website, harvest all URL's, and add the ones of them it didn't visit already to List< String> "toVisit"
Problematic code:
Elements temp = userAgent.visit(currentUrl).findEvery("<a href>");
for (Element e : temp) {
String x = e.getAt("href");
if(!visited.contains(x)) {
toVisit.add(x);
}
}
However, the if statement doesn't filter (or filter it in way I didn't find out) url's and I have no idea why.
I tried delete the "!" in the statement and create an else part and paste toVisit.add(x) there, but it didn't help.
When I print every url, the bot visits the same ones two or even five times.
EDIT (visited defined)
static List<String> visited = new ArrayList<String>();
EDIT2 (whole code)
import java.util.ArrayList;
import java.util.List;
import com.jaunt.*;
public class b03 {
static String currentUrl = "https://stackoverflow.com";
static String stayAt = currentUrl;
static String searchingTerm = "";
static int toSearch = 50;
static List<String> toVisit = new ArrayList<String>();
static List<String> visited = new ArrayList<String>();
static UserAgent userAgent = new UserAgent();
public static void main(String[] args) {
System.out.println("*started searching...*");
while(visited.size() < toSearch)
visitUrl(currentUrl);
System.out.println("\n\n*done*\n\n");
}
public static void visitUrl(String url) {
visited.add(url);
evaluateUrls();
searchTerm();
toVisit.remove(0);
currentUrl = toVisit.get(0);
}
public static void searchTerm() {
//if(userAgent.doc.getTextContent().contains(searchingTerm))
System.out.println(visited.size() +") "+ currentUrl);
}
public static void evaluateUrls() {
try {
Elements temp = userAgent.visit(currentUrl).findEvery("<a href>");
for (Element e : temp) {
String x = e.getAt("href");
if(!visited.contains(x) && x.contains(stayAt)) {
toVisit.add(x);
}
}
}catch (Exception e) {
System.out.println(e);
}
}
}