0

I'm working on web scraper and I can't solve problem I'm having for the second day in row.

The problem with this method is when the bot is supposed to visit the website, harvest all URL's, and add the ones of them it didn't visit already to List< String> "toVisit"

Problematic code:

Elements temp = userAgent.visit(currentUrl).findEvery("<a href>");
for (Element e : temp) {
    String x = e.getAt("href");
    if(!visited.contains(x)) { 
            toVisit.add(x);
    }
}

However, the if statement doesn't filter (or filter it in way I didn't find out) url's and I have no idea why.

I tried delete the "!" in the statement and create an else part and paste toVisit.add(x) there, but it didn't help.

When I print every url, the bot visits the same ones two or even five times.

EDIT (visited defined)

static List<String> visited = new ArrayList<String>();

EDIT2 (whole code)

import java.util.ArrayList;
import java.util.List;
import com.jaunt.*;

public class b03 {

    static String currentUrl = "https://stackoverflow.com";
    static String stayAt = currentUrl;
    static String searchingTerm = "";
    static int toSearch = 50;

    static List<String> toVisit = new ArrayList<String>();
    static List<String> visited = new ArrayList<String>();

    static UserAgent userAgent = new UserAgent();   

    public static void main(String[] args) {
        System.out.println("*started searching...*");

        while(visited.size() < toSearch)
            visitUrl(currentUrl);

        System.out.println("\n\n*done*\n\n");
    }

    public static void visitUrl(String url) {
            visited.add(url);
            evaluateUrls();
            searchTerm();
            toVisit.remove(0);
            currentUrl = toVisit.get(0);
    }

    public static void searchTerm() {
        //if(userAgent.doc.getTextContent().contains(searchingTerm)) 
            System.out.println(visited.size() +") "+ currentUrl);
    }

    public static void evaluateUrls() {
        try {
            Elements temp = userAgent.visit(currentUrl).findEvery("<a href>");
            for (Element e : temp) {
                String x = e.getAt("href");
                if(!visited.contains(x) && x.contains(stayAt)) { 
                        toVisit.add(x);
                }
            }
        }catch (Exception e) {
            System.out.println(e);
        }
    }
}
Ted Klein Bergman
  • 9,146
  • 4
  • 29
  • 50

2 Answers2

0

Your bot visits the some urls several times because you add them several times to the toVisit list.

To illustrate this: let's assume that the first few links that your bot find on the stackoverflow site are the links to "home" (stackoverflow.com), tags (stackoverflow.com/tags), users (stackoverflow.com/users) and jobs (stackoverflow.jobs) and your bot adds three of those to the toVisit list.

Next it visits the tags page (stackoverflow.com/tags). This page contains again links to the same four urls as before. Since you didn't yet visit the users and the jobs subpage it will add those a second time to the toVisit list.

To fix this, you should only add urls to the toVisit list that are not in the visited list and not in the toVisit list:

        if (!visited.contains(x) && !toVisit.contains(x) && x.contains(stayAt)) { 
            toVisit.add(x);
        }
Thomas Kläger
  • 17,754
  • 3
  • 23
  • 34
-1

I can not try this code because of the jaunt lib

Split your code, make it readable. Dont use "static" as much as possible.

Hope it helps

import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;

import com.jaunt.*;

public class B03 {

    static String currentUrl = "https://stackoverflow.com";
    static String stayAt = currentUrl;
    static String searchingTerm = "";
    static int toSearch = 50;

    static List<String> toVisit = new ArrayList<String>();
    static List<String> visited = new ArrayList<String>();

    static UserAgent userAgent = new UserAgent();

    public static void main(String[] args) {
        System.out.println("*started searching...*");

        toVisit.add(currentUrl);

        while(toVisit.size() > 0 && visited.size() < toSearch){
            visitUrl(toVisit.get(0));
        }

        System.out.println("\n\n*done*\n\n");
    }

    public static void visitUrl(String url) {
        List<String> ee = evaluateUrls(url);
        searchTerm(url);

        visited.add(url);
        toVisit.remove(url);
        toVisit.addAll(ee.stream().filter( e -> !visited.contains(e)).collect(Collectors.toList()));

        toVisit.remove(0);
    }

    public static void searchTerm(String currentUrl) {
        //if(userAgent.doc.getTextContent().contains(searchingTerm))
        System.out.println(visited.size() +") "+ currentUrl);
    }

    public List<String> evaluateUrls(String currentUrl) {
        List<String> subUrls = new ArrayList<>();
        try {
            Elements temp = userAgent.visit(currentUrl).findEvery("<a href>");
            for (Element e : temp) {
                String x = e.getAt("href");
                subUrls.add(x);
            }
        }catch (Exception e) {
            System.out.println(e);
        }
        return subUrls;
    }
}
ggr
  • 294
  • 1
  • 9